Contents

1 Install required packages

Install bigWig and latticeExtra package

install.packages("devtools", quiet = TRUE)
## also installing the dependencies 'Rcpp', 'utf8', 'askpass', 'credentials', 'openssl', 'sys', 'zip', 'gitcreds', 'httr2', 'ini', 'httpuv', 'xtable', 'sourcetools', 'later', 'promises', 'fansi', 'systemfonts', 'textshaping', 'pillar', 'pkgconfig', 'diffobj', 'rematch2', 'clipr', 'crayon', 'curl', 'gert', 'gh', 'purrr', 'rprojroot', 'rstudioapi', 'whisker', 'shiny', 'callr', 'processx', 'downlit', 'httr', 'ragg', 'tibble', 'xml2', 'htmlwidgets', 'prettyunits', 'xopen', 'brew', 'commonmark', 'cpp11', 'brio', 'praise', 'ps', 'waldo', 'usethis', 'desc', 'miniUI', 'pkgbuild', 'pkgdown', 'pkgload', 'profvis', 'rcmdcheck', 'remotes', 'roxygen2', 'rversions', 'sessioninfo', 'testthat', 'urlchecker', 'withr'
library(devtools)
## Loading required package: usethis
devtools::install_github('andrelmartins/bigWig',
              subdir='bigWig')
## Downloading GitHub repo andrelmartins/bigWig@HEAD
## ── R CMD build ─────────────────────────────────────────────────────────────────
## * checking for file ‘/private/tmp/RtmpGFtwLg/remotes13d448eac5f3/andrelmartins-bigWig-beac5a7/bigWig/DESCRIPTION’ ... OK
## * preparing ‘bigWig’:
## * checking DESCRIPTION meta-information ... OK
## * cleaning src
## * checking for LF line-endings in source and make files and shell scripts
## * checking for empty or unneeded directories
## * building ‘bigWig_0.2-9.tar.gz’
library(bigWig)

install.packages("latticeExtra", quiet = TRUE)
## also installing the dependencies 'deldir', 'RcppEigen', 'png', 'jpeg', 'RColorBrewer', 'interp'
install.packages("DESeq2", quiet = TRUE)
## Warning: package 'DESeq2' is not available for this version of R
## 
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

Install bedtools

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install bedtools

2 Dec 1st

2.1 Goal:

In previous analysis, we have showed that using a subset of GATA3-peak with known enriched GATA3 motif (GAT—-ATC), we can successfully generate a composite profile with expected GAT to ATC 3mer distribution. –This is answering if we could build a pipeline to successfully demonstrate a known 3mer-3mer structure (two 3mer sites with fixed number of spacing in between).

With this analysis pipeline in hand, the real question we want to ask is:

1) Is our GATA3 ChIP-seq library has predominant enrichment of such 3mer-3mer (GAT-GAT or GAT-ATC) structure compared to any other library that are not ChIPed by GATA3?

2) Is this GAT-GAT or GAT-ATC structure more common to be observed than any other 3mer combinations?

2.2 GATA3 ChIP peak without MEME/STREME found GATA3 motif

Background: we have GATA3 peak that have exhaustively searched for GATA3-like motif with MEME/STREME software. However, there are ~40% peak failed in finding a GATA3-like motif. Even for peak subset that have relatively high peak intensity (top5%), we still have ~20% peaks that MEME/STREME cannot find a GATA3-like motif.

Generally, we would expect to find binding sequences within the peak region, so that the ChIPed transcription factor (in our case, GATA3) can bind to. This is even more true for peaks with relatively high intensity.

Our question now became, is MEME/STREME limited in finding binding sequences patterned like GATA3 (that has fixed spacings between two 3mer)? Can we find the 3mer-3mer sequences in GATA3 peaks that do not contain the MEME/STREME found GATA3-like motifs?

peaks without 9 GATA3-motif (that found by MEME/STREME)

wc -l without_motifs_123456_789.bed
head -5 without_motifs_123456_789.bed
##    38980 without_motifs_123456_789.bed
## chr1 827280  827481  GATA_ChIP_peak_28   9.30478
## chr1 916669  916870  GATA_ChIP_peak_31   7.79887
## chr1 924753  924954  GATA_ChIP_peak_33   3.78065
## chr1 966553  966754  GATA_ChIP_peak_34   3.78065
## chr1 999408  999609  GATA_ChIP_peak_36   2.11515

Peak Intensity
In the above file, the last column is the peak intensity reported by MACS3.
Notice that a peak with high MACS3-intensity is not necessarily a intense peak. There are two things we need to consider regarding peak intensities. The first is peak region; the second is dynamic range. Imaging a peak that is very intense but within a narrow range, the other peak is not so intense but can span its signals across a very long distance. We need to consider this region differences while calling any peak to be intense or not.

First use Sam Flag 0x3(reads paired and mapped in proper pair) to calculate the read depth for each GATA library.

# calculate the size factors 
module load samtools/1.12

dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/sorted.bam_final/
for i in GATA
do
  echo $i
  > ${i}_header.txt
  > ${i}_reads.txt
  for j in ${dir}MCF7_dTAGGATA522*_${i}_*.sorted.bam
  do
    echo $j
    name=$(echo $j | awk -F $dir '{print $2}' | awk -F".sorted.bam" '{print $1}')
    echo $name | paste ${i}_header.txt - > ${i}_tmp.txt 
    mv ${i}_tmp.txt ${i}_header.txt
    reads=`samtools view -c -f 0x3 $j` #count the reads paired and mapped in proper pair
    echo $reads | paste ${i}_reads.txt - > ${i}_tmp.txt 
    mv ${i}_tmp.txt ${i}_reads.txt
  done  
  cat ${i}_header.txt ${i}_reads.txt > ${i}_tmp.txt
  mv ${i}_tmp.txt ${i}_reads.txt
  rm ${i}_header.txt
done 
cat GATA_reads.txt
##  MCF7_dTAGGATA522_GATA_CC_rep1   MCF7_dTAGGATA522_GATA_CC_rep2   MCF7_dTAGGATA522_GATA_CC_rep3   MCF7_dTAGGATA522_GATA_CE_rep1   MCF7_dTAGGATA522_GATA_CE_rep2   MCF7_dTAGGATA522_GATA_CE_rep3   MCF7_dTAGGATA522_GATA_dE_rep1   MCF7_dTAGGATA522_GATA_dE_rep2   MCF7_dTAGGATA522_GATA_dE_rep3
##  33948616    32585396    34475586    51112588    147834968   136838760   34136142    136271358   85665512

bargraph

library(lattice)
df=as.data.frame(t(read.table("GATA_reads.txt", header=F)))
colnames(df)=c("library", "aligned_reads")
df$libraey=as.factor(df$library)
df$aligned_reads=as.numeric((df$aligned_reads))

barchart(aligned_reads ~ library, 
         data = df,
         ylim=c(0, max(df$aligned_reads)*1.04),
         col = "skyblue",
         scales = list(x = list(rot = 45)),
         xlab = "GATA3 ChIP library", 
         ylab = "concordantly aligned reads"
         )

This bar graph represents the read depth in each GATA3 ChIP-seq library. In the downstream analysis, we will need to use DESeq2 to normalize the counts in each library with size factor to account for this read depth difference.  

Get peak intensity within 400bp window withDeseq2 for peaks without MEME/STREME found motifs 12345678.

load peaks without motif 12345678.

a =  read.table('without_motifs_123456_789.bed', sep = "\t", header=FALSE) 
nrow(a)
## [1] 38980
head(a)
##     V1      V2      V3                V4      V5
## 1 chr1  827280  827481 GATA_ChIP_peak_28 9.30478
## 2 chr1  916669  916870 GATA_ChIP_peak_31 7.79887
## 3 chr1  924753  924954 GATA_ChIP_peak_33 3.78065
## 4 chr1  966553  966754 GATA_ChIP_peak_34 3.78065
## 5 chr1  999408  999609 GATA_ChIP_peak_36 2.11515
## 6 chr1 1000436 1000637 GATA_ChIP_peak_37 5.52004

Increase the width to 400bp window.

library(bigWig)
peak.region.400win=center.bed(a, upstreamWindow = 200, downstreamWindow = 200)
nrow(peak.region.400win)
## [1] 38980
head(peak.region.400win)
##     V1      V2      V3                V4      V5
## 1 chr1  827180  827581 GATA_ChIP_peak_28 9.30478
## 2 chr1  916569  916970 GATA_ChIP_peak_31 7.79887
## 3 chr1  924653  925054 GATA_ChIP_peak_33 3.78065
## 4 chr1  966453  966854 GATA_ChIP_peak_34 3.78065
## 5 chr1  999308  999709 GATA_ChIP_peak_36 2.11515
## 6 chr1 1000336 1000737 GATA_ChIP_peak_37 5.52004

In the below chunk, we define a function to get raw counts from each sample/reps. This function uses bed.region.bpQuery.bigWig from bigWig package. It requires a bed region file that has the peak region that we want to analysis (right now we want to analyse peak without motifs 123456789); it also requires non-normalized, raw bigWig files (generated directly from seqOutbias) to allow bed.region.bpQuery.bigWig to query the counts information from. The output from this function will be: rows will be the same as bed file, columns will be each bigWig library, entries will be the raw counts.

#functions on github
source('https://raw.githubusercontent.com/mjg54/znf143_pro_seq_analysis/master/docs/ZNF143_functions.R')

#function
get.counts.interval <- function(df, path.to.bigWig, file.prefix = 'H') {
    vec.names = c()
    inten.df=data.frame(matrix(ncol = 0, nrow = nrow(df)))
    
    for (mod.bigWig in Sys.glob(file.path(path.to.bigWig, paste(file.prefix, "*.bigWig", sep ='')))) {
        factor.name = strsplit(strsplit(mod.bigWig, "/")[[1]][length(strsplit(mod.bigWig, "/")[[1]])], '.bigWig')[[1]][1]
        print(factor.name)
        vec.names = c(vec.names, factor.name)
        loaded.bw = load.bigWig(mod.bigWig)
        print(mod.bigWig)
        mod.inten = bed.region.bpQuery.bigWig(loaded.bw, df, abs.value = TRUE)
        inten.df = cbind(inten.df, mod.inten)
    }
    colnames(inten.df) = vec.names
    r.names = paste(df[,1], ':', df[,2], '-', df[,3], sep='')
    row.names(inten.df) = r.names
    return(inten.df)
}

We will first use the defined function to query raw counts from each non-normalized bigWig files of GATA ChIP-seq use peak region info loaded before (data frame “peak.region.400win”).

library(bigWig)
#non-normalized counts
GATA.counts.df= get.counts.interval(peak.region.400win, "/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/bigWigs/Seqoutbias_bw","MCF") #25 libraries
#nrow(peak.region.400win)
#[1] 38980
#nrow(GATA.counts.df)
#[1] 38980
#head(GATA.counts.df)
#colnames(GATA.counts.df)


GATA.analysis.regions=GATA.counts.df[,grepl("_GATA_",colnames(GATA.counts.df))] # get non-normalized counts from "GATA" libraries
#colnames(GATA.analysis.regions)
#[1] "MCF7_dTAGGATA522_GATA_CC_rep1" "MCF7_dTAGGATA522_GATA_CC_rep2"
#[3] "MCF7_dTAGGATA522_GATA_CC_rep3" "MCF7_dTAGGATA522_GATA_CE_rep1"
#[5] "MCF7_dTAGGATA522_GATA_CE_rep2" "MCF7_dTAGGATA522_GATA_CE_rep3"
#[7] "MCF7_dTAGGATA522_GATA_dE_rep1" "MCF7_dTAGGATA522_GATA_dE_rep2"
#[9] "MCF7_dTAGGATA522_GATA_dE_rep3"
identical(rownames(GATA.analysis.regions),rownames(GATA.counts.df)) # [1] TRUE

Then we use the DESeq2 package to make a counts matrix (DESeqDataSetFromMatrix) and calculate size factors for each library (estimateSizeFactorsForMatrix) use the previously calculated read depth for each library (“GATA_reads.txt”); We use this size factor to normalize the counts (sizeFactors).
Then we use rowMeans to average the normalized counts for the three GATA_CC reps, and save it as “peak.intensities”; this normalized counts now can be use to determine if a peak is intense or not.

library(DESeq2)
sample.conditions = factor(sapply(strsplit(colnames(GATA.analysis.regions), '_rep'), '[', 1))
#[1] MCF7_dTAGGATA522_GATA_CC MCF7_dTAGGATA522_GATA_CC MCF7_dTAGGATA522_GATA_CC
#[4] MCF7_dTAGGATA522_GATA_CE MCF7_dTAGGATA522_GATA_CE MCF7_dTAGGATA522_GATA_CE
#[7] MCF7_dTAGGATA522_GATA_dE MCF7_dTAGGATA522_GATA_dE MCF7_dTAGGATA522_GATA_dE
#3 Levels: MCF7_dTAGGATA522_GATA_CC ... MCF7_dTAGGATA522_GATA_dE
deseq.counts.table = DESeqDataSetFromMatrix(countData = GATA.analysis.regions, # DESeqDataSet needs countData to be non-negative integers; non-normalized counts are integer, normalized signals has decimals.
                colData = as.data.frame(sample.conditions),
                design = ~ sample.conditions)


GATA.SF <- read.table("GATA_reads.txt", sep = '\t', header = TRUE)[,-1] # GATA size factors from read depth
#MCF7_dTAGGATA522_GATA_CC_rep1 MCF7_dTAGGATA522_GATA_CC_rep2
#                      33948616                      32585396
#  MCF7_dTAGGATA522_GATA_CC_rep3 MCF7_dTAGGATA522_GATA_CE_rep1
#                      34475586                      51112588
#  MCF7_dTAGGATA522_GATA_CE_rep2 MCF7_dTAGGATA522_GATA_CE_rep3
#                     147834968                     136838760
#  MCF7_dTAGGATA522_GATA_dE_rep1 MCF7_dTAGGATA522_GATA_dE_rep2
#                      34136142                     136271358
#  MCF7_dTAGGATA522_GATA_dE_rep3
#                      85665512
GATA.size.factors = estimateSizeFactorsForMatrix(GATA.SF) # read depth transformed to size factor 
#MCF7_dTAGGATA522_GATA_CC_rep1 MCF7_dTAGGATA522_GATA_CC_rep2 
#                    0.5385594                     0.5169334 
#MCF7_dTAGGATA522_GATA_CC_rep3 MCF7_dTAGGATA522_GATA_CE_rep1 
#                    0.5469193                     0.8108480 
#MCF7_dTAGGATA522_GATA_CE_rep2 MCF7_dTAGGATA522_GATA_CE_rep3 
#                    2.3452478                     2.1708044 
#MCF7_dTAGGATA522_GATA_dE_rep1 MCF7_dTAGGATA522_GATA_dE_rep2 
#                    0.5415343                     2.1618032 
#MCF7_dTAGGATA522_GATA_dE_rep3 
#                    1.3589941

                                       
sizeFactors(deseq.counts.table) <- GATA.size.factors # assign to each column of the count matrix (deseq.counts.table) the size factor to bring each column to a common scale
dds <- DESeq(deseq.counts.table)
normalized.counts.GATA3 = counts(dds, normalized=TRUE)
head(normalized.counts.GATA3)
peak.intensities = rowMeans(normalized.counts.GATA3[,1:3]) # we want to get the average read counts for CC groups
names(peak.intensities) = rownames(normalized.counts.GATA3)
save.image('231205_GATA3_ChIP_deseq.Rdata') 

Subset peaks without the 8 GATA3-motifs, in top20% intensity quantile

We can load the saved Rdata and look at each dataframe.

load('231205_GATA3_ChIP_deseq.Rdata')
## Warning: namespace 'DESeq2' is not available and has been replaced
## by .GlobalEnv when processing object 'dds'

Read this .bed file into R, and use DeSeq2 to count read size and parse into different quantile

head(peak.intensities)
##   chr1:827180-827581   chr1:916569-916970   chr1:924653-925054 
##             45.32469             36.69993             30.44684 
##   chr1:966453-966854   chr1:999308-999709 chr1:1000336-1000737 
##             26.87016             21.22455             24.34263
quantile(peak.intensities, probs = seq(.25, 1.00, by = .25))
##        25%        50%        75%       100% 
##   34.13704   46.04222   71.43499 6270.21535
quantile(peak.intensities, probs = seq(.20, 1.00, by = .20))
##        20%        40%        60%        80%       100% 
##   31.94822   40.59628   52.89014   83.30172 6270.21535

violin plot

This violin plot is showing the distribution of peak intensity (log10).

library(lattice)
log10quantiles <- quantile(log(abs(peak.intensities), base = 10), probs = seq(.20, 1.00, by = .20))

png('violinplot_GATA3_ChIP_peak_normalized_intensity.png')
print(bwplot(log(abs(peak.intensities), base = 10) ~ factor("1"), 
       main = "Violin-Like Plot",
       panel = function(x, ...) {
         panel.violin(x, ...)
         panel.abline(h = log10quantiles, col = "red", lty = 2)
       },
       xlab = "", ylab = "log10 normalized Intensity")
)
normalized GATA3 ChIP peak intensity

Figure 1: normalized GATA3 ChIP peak intensity

This violin plot represents the distribution and probability density of (log10) normalized peak intensity (for peak without motif 1~9). Each red dotted line indicates the quantile cutoff of 20%, which can help us to visualize the peak intensity distribution across different quantile.  

Parse peaks into 5 intensity quantile by 20%.

chr = sapply(strsplit(names(peak.intensities), ":"), "[", 1)
rnge = sapply(strsplit(names(peak.intensities), ":"), "[", 2)
start = as.numeric(sapply(strsplit(rnge, "-"), "[", 1)) + 200
end = as.numeric(sapply(strsplit(rnge, "-"), "[", 2)) - 200

quantile(peak.intensities, probs = seq(.20, 1.00, by = .20))
##        20%        40%        60%        80%       100% 
##   31.94822   40.59628   52.89014   83.30172 6270.21535
# 1bp summit quantile file
j =0 
q=seq(.20, 1.00, by = .20)
count=0
for (i in quantile(peak.intensities, probs = seq(.20, 1.00, by = .20))){
count = count +1

write.table(file = paste0('quantile', as.character(q[count]), '_summits.bed'), data.frame(chr[peak.intensities > j & peak.intensities <= i], start[peak.intensities > j & peak.intensities <= i], end[peak.intensities > j & peak.intensities <= i], peak.intensities[peak.intensities > j & peak.intensities <= i]), sep = '\t', quote=FALSE, col.names=FALSE, row.names=FALSE )
j = i
}
for i in *quantile*.bed
do
wc -l $i
done
##     7594 all.distance.quantile0.2.bed
##     7796 all.distance.quantile0.4.bed
##     7796 all.distance.quantile0.6.bed
##     7796 all.distance.quantile0.8.bed
##     7796 all.distance.quantile1.bed
##     7563 closest.2nd.distance.quantile0.2.bed
##     7781 closest.2nd.distance.quantile0.4.bed
##     7785 closest.2nd.distance.quantile0.6.bed
##     7781 closest.2nd.distance.quantile0.8.bed
##     7791 closest.2nd.distance.quantile1.bed
##     7594 closest_1st_GAT_to_GATA_peak_quantile0.2.bed
##     7796 closest_1st_GAT_to_GATA_peak_quantile0.4.bed
##     7796 closest_1st_GAT_to_GATA_peak_quantile0.6.bed
##     7796 closest_1st_GAT_to_GATA_peak_quantile0.8.bed
##     7796 closest_1st_GAT_to_GATA_peak_quantile1.bed
##     7563 closest_1st_GAT_to_GATA_uniq_peak_quantile0.2.bed
##     7781 closest_1st_GAT_to_GATA_uniq_peak_quantile0.4.bed
##     7785 closest_1st_GAT_to_GATA_uniq_peak_quantile0.6.bed
##     7781 closest_1st_GAT_to_GATA_uniq_peak_quantile0.8.bed
##     7791 closest_1st_GAT_to_GATA_uniq_peak_quantile1.bed
##     7569 closest_1st_GAT_to_localctrl_200_uniq_quantile0.2.sorted.bed
##     7791 closest_1st_GAT_to_localctrl_200_uniq_quantile0.4.sorted.bed
##     7784 closest_1st_GAT_to_localctrl_200_uniq_quantile0.6.sorted.bed
##     7781 closest_1st_GAT_to_localctrl_200_uniq_quantile0.8.sorted.bed
##     7792 closest_1st_GAT_to_localctrl_200_uniq_quantile1.sorted.bed
##     7594 localctrl.200.distance.quantile0.2.bed
##     7796 localctrl.200.distance.quantile0.4.bed
##     7796 localctrl.200.distance.quantile0.6.bed
##     7796 localctrl.200.distance.quantile0.8.bed
##     7796 localctrl.200.distance.quantile1.bed
##     7594 quantile0.2_summits.200bpctrl.bed
##     7594 quantile0.2_summits.bed
##     7796 quantile0.4_summits.200bpctrl.bed
##     7796 quantile0.4_summits.bed
##     7796 quantile0.6_summits.200bpctrl.bed
##     7796 quantile0.6_summits.bed
##     7796 quantile0.8_summits.200bpctrl.bed
##     7796 quantile0.8_summits.bed
##     7796 quantile1_summits.200bpctrl.bed
##     7796 quantile1_summits.bed

Now working on the top 20% quantile peaks:

wc -l quantile1_summits.bed
head quantile1_summits.bed

2.2.1 Find closest GAT to peak

In this section, we use bedtools closestBed (refer to: https://bedtools.readthedocs.io/en/latest/content/tools/closest.html) to find the closest GAT to each provided peak summit.
Input:
-a is the sorted peak summit file (centered 1bp);
-b is the sorted, and concatenated GAT coordinates file (both plus and minus);

Input1 -a: peak: quantile1_summits.bed : peaks without motif 123456789, top20% intensity

dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/overrep_3mer/peak_without_123456789/
head ${dir}quantile1_summits.bed

Input2 -b: GAT coordinates on full hg38 use read size==30

dir=/labs/Guertin/siyu/Sathyan_GATA3_ChIP_pool1_pool2/overrep_3mer/hg38_full_kmer3_rs30/seqdump/
head ${dir}hg38.3.3.3minus.14_GAT.bed  
head ${dir}hg38.3.3.3plus.36_GAT.bed

head ${dir}hg38.3.3.3minus.36_ATC.bed  
head ${dir}hg38.3.3.3plus.14_ATC.bed

These files are very large, here are the headed version:

head head2000_hg38.3.3.3plus.36_GAT.bed
echo ""
head head2000_hg38.3.3.3minus.14_GAT.bed
echo ""
head head2000_hg38.3.3.3plus.14_ATC.bed
echo ""
head head2000_hg38.3.3.3minus.36_ATC.bed
## chr1 11521   11524   36  36  +   GAT
## chr1 12188   12191   36  36  +   GAT
## chr1 13274   13277   36  36  +   GAT
## chr1 13314   13317   36  36  +   GAT
## chr1 15171   15174   36  36  +   GAT
## chr1 15196   15199   36  36  +   GAT
## chr1 15883   15886   36  36  +   GAT
## chr1 16274   16277   36  36  +   GAT
## chr1 33401   33404   36  36  +   GAT
## chr1 39192   39195   36  36  +   GAT
## 
## chr1 11145   11148   14  14  -   GAT
## chr1 11160   11163   14  14  -   GAT
## chr1 11576   11579   14  14  -   GAT
## chr1 13315   13318   14  14  -   GAT
## chr1 19797   19800   14  14  -   GAT
## chr1 27023   27026   14  14  -   GAT
## chr1 28605   28608   14  14  -   GAT
## chr1 30925   30928   14  14  -   GAT
## chr1 47300   47303   14  14  -   GAT
## chr1 47306   47309   14  14  -   GAT
## 
## chr1 13315   13318   14  14  +   ATC
## chr1 15172   15175   14  14  +   ATC
## chr1 19751   19754   14  14  +   ATC
## chr1 30902   30905   14  14  +   ATC
## chr1 38888   38891   14  14  +   ATC
## chr1 39150   39153   14  14  +   ATC
## chr1 39215   39218   14  14  +   ATC
## chr1 41955   41958   14  14  +   ATC
## chr1 47094   47097   14  14  +   ATC
## chr1 47641   47644   14  14  +   ATC
## 
## chr1 11159   11162   36  36  -   ATC
## chr1 11462   11465   36  36  -   ATC
## chr1 11569   11572   36  36  -   ATC
## chr1 13314   13317   36  36  -   ATC
## chr1 27022   27025   36  36  -   ATC
## chr1 40266   40269   36  36  -   ATC
## chr1 41985   41988   36  36  -   ATC
## chr1 47682   47685   36  36  -   ATC
## chr1 47780   47783   36  36  -   ATC
## chr1 49295   49298   36  36  -   ATC

we will concatanate the plus and minus 3mer file together.

cat ${dir}hg38.3.3.3minus.14_GAT.bed ${dir}hg38.3.3.3plus.36_GAT.bed > hg38.3.3.3.30_plus_minus_GAT.bed
cat ${dir}hg38.3.3.3minus.36_ATC.bed ${dir}hg38.3.3.3plus.14_ATC.bed > hg38.3.3.3.30_plus_minus_ATC.bed

wc -l ${dir}hg38.3.3.3minus.14_GAT.bed #32096009
wc -l ${dir}hg38.3.3.3plus.36_GAT.bed #32147038
wc -l hg38.3.3.3.30_plus_minus_GAT.bed #64243047
echo ""
wc -l ${dir}hg38.3.3.3minus.36_ATC.bed #32081985
wc -l ${dir}hg38.3.3.3plus.14_ATC.bed #32035017
wc -l hg38.3.3.3.30_plus_minus_ATC.bed #64117002

load files

load files contains selected ChIP peak summit info

chip.peak.summit=read.table("quantile1_summits.bed", header=FALSE)
nrow(chip.peak.summit)
## [1] 7796
head(chip.peak.summit)
##     V1      V2      V3        V4
## 1 chr1 5598187 5598188 271.93122
## 2 chr1 8017660 8017661 676.31362
## 3 chr1 8020178 8020179 324.95549
## 4 chr1 8055212 8055213 111.55874
## 5 chr1 8061475 8061476  98.51927
## 6 chr1 8120791 8120792 124.49564

load files contains 3mer coordinates info

#concatenate the plus and minus file together
all.GAT.file=read.table(file = "hg38.3.3.3.30_plus_minus_GAT.bed", sep="\t", header=FALSE)
head(all.GAT.file)
#    V1    V2    V3 V4 V5 V6  V7
#1 chr1 11145 11148 14 14  - GAT
#2 chr1 11160 11163 14 14  - GAT
#3 chr1 11576 11579 14 14  - GAT
#4 chr1 13315 13318 14 14  - GAT
#5 chr1 19797 19800 14 14  - GAT
#6 chr1 27023 27026 14 14  - GAT
tail(all.GAT.file)
#                             V1    V2    V3 V4 V5 V6  V7
#64243042 chrY_KI270740v1_random 33028 33031 36 36  + GAT
#64243043 chrY_KI270740v1_random 35533 35536 36 36  + GAT
#64243044 chrY_KI270740v1_random 35543 35546 36 36  + GAT
#64243045 chrY_KI270740v1_random 35912 35915 36 36  + GAT
#64243046 chrY_KI270740v1_random 36540 36543 36 36  + GAT
#64243047 chrY_KI270740v1_random 36913 36916 36 36  + GAT

nrow(all.GAT.file)
#[1] 64243047

bedtools closestBed
The below function will sort input bed1 and bed2 first, then run bedtools closestBed between bed1 and bed2.

# define function 
bedTools.closest <- function(functionstring="/home/FCAM/ssun/packages/bedtools2/bin/closestBed",bed1,bed2,opt.string="") {
  
  options(scipen =99) # not use scientific notation when writing out
  
  #write bed formatted data.frames to tempfile
  write.table(bed1,file= 'a.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
  write.table(bed2,file= 'b.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
  
  # create the command string and call the command using system()
  # the command sort a and b file by coordinates
  command1=paste('sort -k1,1 -k2,2n', 'a.file.bed', '> a.file.sorted.bed')
  cat(command1,"\n") #sort -k1,1 -k2,2n a.file.bed > a.file.sorted.bed
  try(system(command1))
  command2=paste('sort -k1,1 -k2,2n', 'b.file.bed', '> b.file.sorted.bed')
  cat(command2,"\n")
  try(system(command2))
  
  # the command call closestBed on bed1 and bed2
  command=paste(functionstring, opt.string,"-a",'a.file.sorted.bed',"-b",'b.file.sorted.bed',">",'out.file.bed',sep=" ")
  cat(command,"\n")
  try(system(command))
  
  res=read.table('out.file.bed',header=F, comment.char='')
  
  # remove intermediate files
  command3=paste('rm', 'a.file.bed', 'b.file.bed', 'a.file.sorted.bed', 'b.file.sorted.bed', 'out.file.bed')
  cat(command3,"\n")
  try(system(command3))
  
  colnames(res) = c(colnames(bed1), colnames(bed2), 'dis' )
  return(res)
}

Parameter -d will report the distance from the closest GAT to the peak summit.
Parameter -t last or -t first will only report either the last or the first entry in bed2 file if tied distance occurred.
Here we choose to use -t first to report the first coordinates when tie occurred.

Find the closest GAT to peak summit regardless of GAT strandedness

all.distance=bedTools.closest(bed1 = chip.peak.summit[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
write.table(all.distance,file= 'all.distance.bed', quote=F,sep="\t",col.names=F,row.names=F)
save.image('231205_GATA3_ChIP_clsestbed.Rdata') 

In this “all.distance’ data frame, V1 to V3 are peak summit coordinates, V4 to V9 are closestBed reported closest motif to each peak summit. The last column (V11) is the distance from closest motif to peak summit.

all.distance=read.table('all.distance.bed',header=F, comment.char='')
head(all.distance)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36  + GAT  21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36  + GAT   3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14  - GAT   6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36  + GAT   8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36  + GAT  32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14  - GAT   4

coherence check 1: the number of the closest minus GAT is comparable to the number of the closest plus GAT.

nrow(all.distance)
## [1] 7796
nrow(all.distance[all.distance$V9=="+",])
## [1] 3873
nrow(all.distance[all.distance$V9=="-",])
## [1] 3923

coherence check 2: Since we have use -t first to take care of the tie, now we have equal number of unique peak with their closest plus or minus or both GAT.

nrow(chip.peak.summit) # there are 8571 unique peak 
## [1] 7796
nrow(all.distance)  
## [1] 7796

2.2.2 making composite profile

2.3 A unioin peak set as control

To answer the first big question, we also need to prepare a control peak set. We want to know the 3mer distribution of GAT/ATC in this random peak set, and compare it with our GATA3 ChIP peaks.

Here we are using the ENCODE union DNase HS sites:
Reference: Amaral, M. L., Erikson, G. A., & Shokhirev, M. N. (2018). BART: bioinformatics array research tool. BMC bioinformatics, 19(1), 1-6.
These union DHS regions represent the whole repertoire of regulatory elements in the human genome. These are regions of chromatin that are sensitive to cleavage by the DNase.

remove GATA3 motifs with MAST

The full .bed file contains 2723010 regulatory regions. To make this file a proper control for our peak set, we use MAST to remove regulatory regions that contains GATA3-like motif 123456789 (use same p-value stringency).

We have 9 motifs that found by MEME/STREME software.

ls GATA3_MEME_STREME_motifs/
#cat GATA3_MEME_STREME_motifs/meme_1.txt
## AGATAAM_streme.txt
## AGATDNHATCT_streme.txt
## AGATNDWNAGATARN_meme.txt_meme_4.txt
## BTTATCWGATB_meme_5.txt_meme.txt
## TGATAA_streme.txt
## WGATBDHRVAGATAA_meme.txt_meme_6.txt
## meme_1.txt
## meme_2.txt
## meme_3.txt

First convert regulatory region file from .bed to .fasta.

module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa

fastaFromBed -fi $genome -bed hg38_unionDHS_fc4_50merge.bed -fo hg38_unionDHS_fc4_50merge.fasta

MEME (mast uses default p-value: 0.0001)

module load meme/5.4.1
module load R/4.1.2
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa

#round1
mast -hit_list -best ../GATA3_MEME_STREME_motifs/meme_1.txt hg38_unionDHS_fc4_50merge.fasta > mast_GATA3_PSWM_in_unionDHS_round1.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_unionDHS_round1.txt
wc -l mast_GATA3_PSWM_in_unionDHS_round1.bed #34324
intersectBed -v -a hg38_unionDHS_fc4_50merge.bed -b mast_GATA3_PSWM_in_unionDHS_round1.bed > unionDHS_without_motifs_1.bed
wc -l unionDHS_without_motifs_1.bed #2688686

#round2
fastaFromBed -fi $genome -bed unionDHS_without_motifs_1.bed -fo unionDHS_without_motifs_1.fasta
mast -hit_list -best ../GATA3_MEME_STREME_motifs/meme_2.txt unionDHS_without_motifs_1.fasta > mast_GATA3_PSWM_in_unionDHS_round2.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_unionDHS_round2.txt
wc -l mast_GATA3_PSWM_in_unionDHS_round2.bed #38803
intersectBed -v -a unionDHS_without_motifs_1.bed -b mast_GATA3_PSWM_in_unionDHS_round2.bed > unionDHS_without_motifs_12.bed
wc -l unionDHS_without_motifs_12.bed #2649883

#round3
fastaFromBed -fi $genome -bed unionDHS_without_motifs_12.bed -fo unionDHS_without_motifs_12.fasta
mast -hit_list -best ../GATA3_MEME_STREME_motifs/meme_3.txt unionDHS_without_motifs_12.fasta > mast_GATA3_PSWM_in_unionDHS_round3.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_unionDHS_round3.txt
wc -l mast_GATA3_PSWM_in_unionDHS_round3.bed #35361
intersectBed -v -a unionDHS_without_motifs_12.bed -b mast_GATA3_PSWM_in_unionDHS_round3.bed > unionDHS_without_motifs_123.bed
wc -l unionDHS_without_motifs_123.bed #2614522

#round4
fastaFromBed -fi $genome -bed unionDHS_without_motifs_123.bed -fo unionDHS_without_motifs_123.fasta
mast -hit_list -best ../GATA3_MEME_STREME_motifs/AGATNDWNAGATARN_meme.txt_meme_4.txt unionDHS_without_motifs_123.fasta > mast_GATA3_PSWM_in_unionDHS_round4.txt 
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_unionDHS_round4.txt
wc -l mast_GATA3_PSWM_in_unionDHS_round4.bed #40994
intersectBed -v -a unionDHS_without_motifs_123.bed -b mast_GATA3_PSWM_in_unionDHS_round4.bed > unionDHS_without_motifs_1234.bed
wc -l unionDHS_without_motifs_1234.bed #2573528

#round5
fastaFromBed -fi $genome -bed unionDHS_without_motifs_1234.bed -fo unionDHS_without_motifs_1234.fasta
mast -hit_list -best ../GATA3_MEME_STREME_motifs/BTTATCWGATB_meme_5.txt_meme.txt unionDHS_without_motifs_1234.fasta > mast_GATA3_PSWM_in_unionDHS_round5.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_unionDHS_round5.txt
wc -l mast_GATA3_PSWM_in_unionDHS_round5.bed #29331
intersectBed -v -a unionDHS_without_motifs_1234.bed -b mast_GATA3_PSWM_in_unionDHS_round5.bed > unionDHS_without_motifs_12345.bed
wc -l unionDHS_without_motifs_12345.bed #2544197


#round6
fastaFromBed -fi $genome -bed unionDHS_without_motifs_12345.bed -fo unionDHS_without_motifs_12345.fasta
mast -hit_list -best ../GATA3_MEME_STREME_motifs/WGATBDHRVAGATAA_meme.txt_meme_6.txt unionDHS_without_motifs_12345.fasta > mast_GATA3_PSWM_in_unionDHS_round6.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_unionDHS_round6.txt
wc -l mast_GATA3_PSWM_in_unionDHS_round6.bed #42590 
intersectBed -v -a unionDHS_without_motifs_12345.bed -b mast_GATA3_PSWM_in_unionDHS_round6.bed > unionDHS_without_motifs_123456.bed 
wc -l unionDHS_without_motifs_123456.bed #2501607

STREME (mast uses p-value of 0.0005)

#round7
fastaFromBed -fi $genome -bed unionDHS_without_motifs_123456.bed -fo unionDHS_without_motifs_123456.fasta
mast -mt 0.0005 -hit_list -best ../GATA3_MEME_STREME_motifs/AGATAAM_streme.txt unionDHS_without_motifs_123456.fasta > mast_GATA3_PSWM_in_unionDHS_round7.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_unionDHS_round7.txt
wc -l mast_GATA3_PSWM_in_unionDHS_round7.bed #163724
intersectBed -v -a unionDHS_without_motifs_123456.bed -b mast_GATA3_PSWM_in_unionDHS_round7.bed > unionDHS_without_motifs_123456_7.bed
wc -l unionDHS_without_motifs_123456_7.bed #2337883


#round8
fastaFromBed -fi $genome -bed unionDHS_without_motifs_123456_7.bed -fo unionDHS_without_motifs_123456_7.fasta
mast -mt 0.0005 -hit_list -best ../GATA3_MEME_STREME_motifs/TGATAA_streme.txt unionDHS_without_motifs_123456_7.fasta > mast_GATA3_PSWM_in_unionDHS_round8.txt 
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_unionDHS_round8.txt
wc -l mast_GATA3_PSWM_in_unionDHS_round8.bed #101810
intersectBed -v -a unionDHS_without_motifs_123456_7.bed -b mast_GATA3_PSWM_in_unionDHS_round8.bed > unionDHS_without_motifs_123456_78.bed 
wc -l unionDHS_without_motifs_123456_78.bed  #2236073

#round9
fastaFromBed -fi $genome -bed unionDHS_without_motifs_123456_78.bed -fo unionDHS_without_motifs_123456_78.fasta
mast -mt 0.0005 -hit_list -best ../GATA3_MEME_STREME_motifs/AGATDNHATCT_streme.txt unionDHS_without_motifs_123456_78.fasta > mast_GATA3_PSWM_in_unionDHS_round9.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_unionDHS_round9.txt
wc -l mast_GATA3_PSWM_in_unionDHS_round9.bed #61162
intersectBed -v -a unionDHS_without_motifs_123456_78.bed -b mast_GATA3_PSWM_in_unionDHS_round9.bed > unionDHS_without_motifs_123456_789.bed
wc -l unionDHS_without_motifs_123456_789.bed #2174911

Randomly subset the union regulatory region
After the removal of the 9 GATA3 motifs with mast, the unionDHS peak set now remains 2174911 regions (originally 2723010, removed ~20% regions).

Here I am using shuf to randomly select 7800 regulatory region to match with the GATA peak subset (quantile1: 7796 peaks) I am going to compare with.

shuf -n 7800 unionDHS_without_motifs_123456_789.bed > random_7800_unionDHS_without_motifs_123456_789.bed 
wc -l random_7800_unionDHS_without_motifs_123456_789.bed  
head random_7800_unionDHS_without_motifs_123456_789.bed  
##     7800 random_7800_unionDHS_without_motifs_123456_789.bed
## chr1 58307349    58307399    68093   50  +
## chr10    80965538    80965588    1642740 50  +
## chr4 176763006   176763056   819823  50  +
## chr11    72552854    72553010    1763461 156 +
## chr1 68189168    68189373    79518   205 +
## chr5 24465079    24465129    852478  50  +
## chr19    17195481    17195531    2459669 50  +
## chr1 94664602    94664652    104394  50  +
## chr2 56036   56086   234875  50  +
## chr17    42150025    42150076    2322289 51  +

Read the file in and use the center as regulatory region summit:

library(bigWig)
chip.peak=read.table("random_7800_unionDHS_without_motifs_123456_789.bed", header=FALSE)
nrow(chip.peak)
## [1] 7800
head(chip.peak)
##      V1        V2        V3      V4  V5 V6
## 1  chr1  58307349  58307399   68093  50  +
## 2 chr10  80965538  80965588 1642740  50  +
## 3  chr4 176763006 176763056  819823  50  +
## 4 chr11  72552854  72553010 1763461 156  +
## 5  chr1  68189168  68189373   79518 205  +
## 6  chr5  24465079  24465129  852478  50  +
chip.peak.summit=center.bed(chip.peak, upstreamWindow=0, downstreamWindow=0)
nrow(chip.peak.summit)
## [1] 7800
head(chip.peak.summit)
##      V1        V2        V3      V4  V5 V6
## 1  chr1  58307374  58307375   68093  50  +
## 2 chr10  80965563  80965564 1642740  50  +
## 3  chr4 176763031 176763032  819823  50  +
## 4 chr11  72552932  72552933 1763461 156  +
## 5  chr1  68189270  68189271   79518 205  +
## 6  chr5  24465104  24465105  852478  50  +

find the closest GAT to summit of unionDHS with closestBed.

ctrl.distance=bedTools.closest(bed1 = chip.peak.summit[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
write.table(ctrl.distance,file= 'ctrl.distance.bed', quote=F,sep="\t",col.names=F,row.names=F)
save.image('231205_GATA3_ChIP_clsestbed2.Rdata') 

In this “ctrl.distance’ data frame, V1 to V3 are peak summit coordinates, V4 to V9 are closestBed reported closest motif to each peak summit. The last column (V11) is the distance from closest motif to peak summit.

ctrl.distance=read.table('ctrl.distance.bed',header=F, comment.char='')
head(ctrl.distance)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1  190928  190929 chr1  191000  191003 14 14  - GAT  72
## 2 chr1 1183557 1183558 chr1 1183537 1183540 14 14  - GAT  18
## 3 chr1 1207307 1207308 chr1 1207290 1207293 14 14  - GAT  15
## 4 chr1 1548961 1548962 chr1 1548964 1548967 36 36  + GAT   3
## 5 chr1 1666170 1666171 chr1 1666192 1666195 36 36  + GAT  22
## 6 chr1 2298957 2298958 chr1 2299000 2299003 36 36  + GAT  43
nrow(ctrl.distance)
## [1] 7800
nrow(ctrl.distance[ctrl.distance$V9=="+",])
## [1] 3894
nrow(ctrl.distance[ctrl.distance$V9=="-",])
## [1] 3906

2.3.1 Cumulative distribution function (CDF) plot

all.distance=read.table('all.distance.bed',header=F, comment.char='')
ctrl.distance=read.table('ctrl.distance.bed',header=F, comment.char='')
head(all.distance)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36  + GAT  21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36  + GAT   3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14  - GAT   6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36  + GAT   8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36  + GAT  32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14  - GAT   4
head(ctrl.distance)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1  190928  190929 chr1  191000  191003 14 14  - GAT  72
## 2 chr1 1183557 1183558 chr1 1183537 1183540 14 14  - GAT  18
## 3 chr1 1207307 1207308 chr1 1207290 1207293 14 14  - GAT  15
## 4 chr1 1548961 1548962 chr1 1548964 1548967 36 36  + GAT   3
## 5 chr1 1666170 1666171 chr1 1666192 1666195 36 36  + GAT  22
## 6 chr1 2298957 2298958 chr1 2299000 2299003 36 36  + GAT  43
nrow(all.distance)
## [1] 7796
nrow(ctrl.distance)
## [1] 7800
df.chip = cbind(all.distance[,c(1:3, 11)], "GATA3_peak")
df.ctrl = cbind(ctrl.distance[,c(1:3, 11)], "ctrl")
colnames(df.chip) = c(colnames(all.distance)[1:3], "dis", "status")
colnames(df.ctrl) = c(colnames(ctrl.distance)[1:3], "dis", "status")
head(df.chip)
##     V1      V2      V3 dis     status
## 1 chr1 5598187 5598188  21 GATA3_peak
## 2 chr1 8017660 8017661   3 GATA3_peak
## 3 chr1 8020178 8020179   6 GATA3_peak
## 4 chr1 8055212 8055213   8 GATA3_peak
## 5 chr1 8061475 8061476  32 GATA3_peak
## 6 chr1 8120791 8120792   4 GATA3_peak
head(df.ctrl)
##     V1      V2      V3 dis status
## 1 chr1  190928  190929  72   ctrl
## 2 chr1 1183557 1183558  18   ctrl
## 3 chr1 1207307 1207308  15   ctrl
## 4 chr1 1548961 1548962   3   ctrl
## 5 chr1 1666170 1666171  22   ctrl
## 6 chr1 2298957 2298958  43   ctrl
df.all = rbind(df.chip, df.ctrl)
df.all$status = factor(df.all$status, levels = c("GATA3_peak", "ctrl"))
head(df.all)
##     V1      V2      V3 dis     status
## 1 chr1 5598187 5598188  21 GATA3_peak
## 2 chr1 8017660 8017661   3 GATA3_peak
## 3 chr1 8020178 8020179   6 GATA3_peak
## 4 chr1 8055212 8055213   8 GATA3_peak
## 5 chr1 8061475 8061476  32 GATA3_peak
## 6 chr1 8120791 8120792   4 GATA3_peak
nrow(df.all)
## [1] 15596

plot CDF

We are plotting the cumulative distribution function to evaluate whether a 3mer GAT are closer to the summit of GATA3 ChIP peaks (that might use 3mer GAT as a binding sequence) than they are to the summit of a random subset of union DHS regions (ctrl).

We can also determine the distance at which the CDFs’ difference between GATA3 peak set and ctrl peak set reach the plateaus (where CDF traces became parallel or converging).

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'ctrl'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'GATA3_peak'])
match.y = seq(0, 7800, by=1) # creating indices
rep.y = seq(0, 7800, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) # 15; 
## [1] 15
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

The above plot plots CDFs’ difference between GATA3 peak set and ctrl peak set. At 15bp, two CDFs start to be parallel/converged.

library(lattice)
library(latticeExtra)

ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.all,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col =  c("#ce228e", "grey60"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,3.5),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col =  c("#ce228e", "grey60"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 1.176, lty =2)
             panel.ecdfplot(...)
         })

The above cumulative distribution function (PDF) plot is showing the distance between 3mer-GAT to the summit of GATA3 ChIP peak (red), or summit of the ctrl peak set (gray, random subset from the union DHS regions).

The left-shift of the red curve suggests that the 3mer-GAT are closer to the GATA3 ChIP peak set then the ctrl peak set.

The vertical dashed line indicates at which distance (log10 of 15bp) the two CDF traces became parallel. This suggests that the 3mer-GAT often binds within 15 bp of the GATA3 peak summit.

ecdfplot(~dis, groups = status, data = df.all,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col =  c("#ce228e", "grey60"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = '3mer-GAT Distance (bp) from peak summit',
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,100),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col =  c("#ce228e", "grey60"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 15, lty =2)
             panel.ecdfplot(...)
         })

The difference between this CDF with the previous one is: The above one is plotting actual distance in bp, the previous one is plotting log10 of the distance. The two CDF plots are essentially communicating the same thing.

2.4 Second closest GAT

Expectation: see spike of the second GAT at fixed position.

Plan:

-Input1: all GAT coordinates .bed file
-Input2: closest GAT (to GATA3 peak summit -without GATA3-like motifs 123456789) coordinates .bed file
-Input3: closest GAT (to unionDHS summit) coordinates .bed file

First, use bedtools subtract to subtract input 1 and 2 (or in ctrl set: subtract input 1 and 3), the output will be GAT coordinates without the 1st closest GAT.
Then,
Use bedtools closestBed to find the second closest GAT to the 1st closest GAT use the output from bedtools subtract.
Plot PDF plot of distances between 1st and 2nd GAT.

2.4.0.1 all GAT coordinates .bed file

Input1
This is the 3mer GAT coordinates on hg38 genome (both plus and minus strand) generated with SeqOutbias use read size==30.

wc -l hg38.3.3.3.30_plus_minus_GAT.bed #64243047
head -5 hg38.3.3.3.30_plus_minus_GAT.bed
#chr1   11145   11148   14  14  -   GAT
#chr1   11160   11163   14  14  -   GAT
#chr1   11576   11579   14  14  -   GAT
#chr1   13315   13318   14  14  -   GAT
#chr1   19797   19800   14  14  -   GAT

2.4.0.2 GATA3 peak subset (without the 9 GATA3-like motifs)

Input2
This all.distance.bed file is searching for closest 3mer GAT to the peak summit (top 20% intense peak without motif 123456789).
Use awk to select 3mer-GAT’s coordinates from the file.

wc -l all.distance.bed #7796
head -5 all.distance.bed
#chr1   5598187 5598188 chr1    5598164 5598167 36  36  +   GAT 21
#chr1   8017660 8017661 chr1    8017663 8017666 36  36  +   GAT 3
#chr1   8020178 8020179 chr1    8020170 8020173 14  14  -   GAT 6
#chr1   8055212 8055213 chr1    8055202 8055205 36  36  +   GAT 8
#chr1   8061475 8061476 chr1    8061441 8061444 36  36  +   GAT 32

awk '{print $4, $5, $6, $7, $8, $9, $10}' all.distance.bed | awk '{$1=$1}1' OFS="\t"> closest_1st_GAT_to_GATA_peak.bed
wc -l closest_1st_GAT_to_GATA_peak.bed #7796

# remove duplicated lines with `uniq`
awk '{print $4, $5, $6, $7, $8, $9, $10}' all.distance.bed | awk '{$1=$1}1' OFS="\t" | sort | uniq> closest_1st_GAT_to_GATA_uniq_peak.bed
wc -l closest_1st_GAT_to_GATA_uniq_peak.bed #7791
head -5 closest_1st_GAT_to_GATA_uniq_peak.bed
awk '$6 == "+" {count++} END {print "Positive strand count:", count}' closest_1st_GAT_to_GATA_uniq_peak.bed
awk '$6 == "-" {count++} END {print "Positive strand count:", count}' closest_1st_GAT_to_GATA_uniq_peak.bed
## chr10    100012503   100012506   14  14  -   GAT
## chr10    100043505   100043508   14  14  -   GAT
## chr10    100365033   100365036   14  14  -   GAT
## chr10    100370706   100370709   14  14  -   GAT
## chr10    101295816   101295819   14  14  -   GAT
## Positive strand count: 3870
## Positive strand count: 3921

Bedtools subtract

-f Requiring a minimal overlap fraction before subtracting. Here we define -f 1.00 to make sure of a 100% overlap between two file before subtracting.
-s Enforcing same “strandedness” while scanning for features in -b file that should be subtracted from -a file.

module load bedtools 
sort -k1,1 -k2,2n hg38.3.3.3.30_plus_minus_GAT.bed > hg38.3.3.3.30_plus_minus_GAT.sorted.bed
head -5 hg38.3.3.3.30_plus_minus_GAT.sorted.bed
#chr1   11145   11148   14  14  -   GAT
#chr1   11160   11163   14  14  -   GAT
#chr1   11521   11524   36  36  +   GAT
#chr1   11576   11579   14  14  -   GAT
#chr1   12188   12191   36  36  +   GAT

sort -k1,1 -k2,2n closest_1st_GAT_to_GATA_uniq_peak.bed > closest_1st_GAT_to_GATA_uniq_peak.sorted.bed 
head -5 closest_1st_GAT_to_GATA_uniq_peak.sorted.bed
#chr1   5598164 5598167 36  36  +   GAT
#chr1   8017663 8017666 36  36  +   GAT
#chr1   8020170 8020173 14  14  -   GAT
#chr1   8055202 8055205 36  36  +   GAT
#chr1   8061441 8061444 36  36  +   GAT

bedtools subtract -a hg38.3.3.3.30_plus_minus_GAT.sorted.bed -b closest_1st_GAT_to_GATA_uniq_peak.sorted.bed -f 1.00 -s > hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT.bed

wc -l hg38.3.3.3.30_plus_minus_GAT.sorted.bed #64243047
wc -l closest_1st_GAT_to_GATA_uniq_peak.sorted.bed #7791
wc -l hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT.bed #64235256

#64243047-7791
#[1] 64235256

2.4.0.3 unionDHS set (without the 9 GATA3-like motifs)

Input3
This ctrl.distance.bed file is searching for closest 3mer GAT to the unionDHS peak summit (ramdom 7800 regions without motif 123456789).
Use awk to select 3mer-GAT’s coordinates from the file.

wc -l ctrl.distance.bed 
head -5 all.distance.bed

awk '{print $4, $5, $6, $7, $8, $9, $10}' ctrl.distance.bed | awk '{$1=$1}1' OFS="\t"> closest_1st_GAT_to_unionDHS_peak.bed
wc -l closest_1st_GAT_to_unionDHS_peak.bed 

awk '{print $4, $5, $6, $7, $8, $9, $10}' ctrl.distance.bed | awk '{$1=$1}1' OFS="\t" | sort | uniq> closest_1st_GAT_to_unionDHS_uniq_peak.bed
wc -l closest_1st_GAT_to_unionDHS_uniq_peak.bed 
##     7800 ctrl.distance.bed
## chr1 5598187 5598188 chr1    5598164 5598167 36  36  +   GAT 21
## chr1 8017660 8017661 chr1    8017663 8017666 36  36  +   GAT 3
## chr1 8020178 8020179 chr1    8020170 8020173 14  14  -   GAT 6
## chr1 8055212 8055213 chr1    8055202 8055205 36  36  +   GAT 8
## chr1 8061475 8061476 chr1    8061441 8061444 36  36  +   GAT 32
##     7800 closest_1st_GAT_to_unionDHS_peak.bed
##     7800 closest_1st_GAT_to_unionDHS_uniq_peak.bed
head -5 closest_1st_GAT_to_unionDHS_uniq_peak.bed
awk '$6 == "+" {count++} END {print "Positive strand count:", count}' closest_1st_GAT_to_unionDHS_uniq_peak.bed
awk '$6 == "-" {count++} END {print "Positive strand count:", count}' closest_1st_GAT_to_unionDHS_uniq_peak.bed
## chr1 10001686    10001689    36  36  +   GAT
## chr1 10039965    10039968    36  36  +   GAT
## chr1 100703247   100703250   14  14  -   GAT
## chr1 100733243   100733246   36  36  +   GAT
## chr1 101907495   101907498   36  36  +   GAT
## Positive strand count: 3894
## Positive strand count: 3906

bedtools subtract

-f Requiring a minimal overlap fraction before subtracting. Here we define -f 1.00 to make sure of a 100% overlap between two file before subtracting.
-s Enforcing same “strandedness” while scanning for features in -b file that should be subtracted from -a file.

module load bedtools 
sort -k1,1 -k2,2n hg38.3.3.3.30_plus_minus_GAT.bed > hg38.3.3.3.30_plus_minus_GAT.sorted.bed
head -5 hg38.3.3.3.30_plus_minus_GAT.sorted.bed
#chr1   11145   11148   14  14  -   GAT
#chr1   11160   11163   14  14  -   GAT
#chr1   11521   11524   36  36  +   GAT
#chr1   11576   11579   14  14  -   GAT
#chr1   12188   12191   36  36  +   GAT

sort -k1,1 -k2,2n closest_1st_GAT_to_unionDHS_uniq_peak.bed > closest_1st_GAT_to_unionDHS_uniq_peak.sorted.bed 
head -5 closest_1st_GAT_to_unionDHS_uniq_peak.sorted.bed
#chr1   191000  191003  14  14  -   GAT
#chr1   1183537 1183540 14  14  -   GAT
#chr1   1207290 1207293 14  14  -   GAT
#chr1   1548964 1548967 36  36  +   GAT
#chr1   1666192 1666195 36  36  +   GAT

bedtools subtract -a hg38.3.3.3.30_plus_minus_GAT.sorted.bed -b closest_1st_GAT_to_unionDHS_uniq_peak.sorted.bed -f 1.00 -s > hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_unionDHS.bed

wc -l hg38.3.3.3.30_plus_minus_GAT.sorted.bed #64243047
wc -l closest_1st_GAT_to_unionDHS_uniq_peak.sorted.bed #7800
wc -l hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_unionDHS.bed #64235247

#64243047-7800
#[1] 64235247

2.4.0.4 Find the 2nd closest GAT to the 1st closest GAT with bedtools closestBed

GATA3 peak subset

Read the file (1st closest GAT to GATA3 peak subset summit) in.

closest_1st_GAT_to_GATA=read.table("closest_1st_GAT_to_GATA_uniq_peak.sorted.bed", header=FALSE)
nrow(closest_1st_GAT_to_GATA)
## [1] 7791
head(closest_1st_GAT_to_GATA)
##     V1      V2      V3 V4 V5 V6  V7
## 1 chr1 5598164 5598167 36 36  + GAT
## 2 chr1 8017663 8017666 36 36  + GAT
## 3 chr1 8020170 8020173 14 14  - GAT
## 4 chr1 8055202 8055205 36 36  + GAT
## 5 chr1 8061441 8061444 36 36  + GAT
## 6 chr1 8120785 8120788 14 14  - GAT

Read the GAT coorrdinates file without the 1st closest GAT to GATA3 peak summit (subset of top20% peaks without GATA3-like motifs 123456789).

all.GAT.without.1st.closest.file=read.table(file = "hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT.bed", sep="\t", header=FALSE)
nrow(all.GAT.without.1st.closest.file)
#[1] 64235256
head(all.GAT.without.1st.closest.file)
#    V1    V2    V3 V4 V5 V6  V7
#1 chr1 11145 11148 14 14  - GAT
#2 chr1 11160 11163 14 14  - GAT
#3 chr1 11521 11524 36 36  + GAT
#4 chr1 11576 11579 14 14  - GAT
#5 chr1 12188 12191 36 36  + GAT
#6 chr1 13274 13277 36 36  + GAT
tail(all.GAT.without.1st.closest.file)
#                             V1    V2    V3 V4 V5 V6  V7
#64235251 chrY_KI270740v1_random 33149 33152 14 14  - GAT
#64235252 chrY_KI270740v1_random 35533 35536 36 36  + GAT
#64235253 chrY_KI270740v1_random 35543 35546 36 36  + GAT
#64235254 chrY_KI270740v1_random 35912 35915 36 36  + GAT
#64235255 chrY_KI270740v1_random 36540 36543 36 36  + GAT
#64235256 chrY_KI270740v1_random 36913 36916 36 36  + GAT

Use bedtools closestBed to find the 2nd closest GAT to the 1st closest GAT use the output from bedtools subtract.

closest.2nd.distance=bedTools.closest(bed1 = closest_1st_GAT_to_GATA[,1:3], bed2 = all.GAT.without.1st.closest.file, opt.string = '-d -t first')
write.table(closest.2nd.distance,file= 'closest.2nd.distance.bed', quote=F,sep="\t",col.names=F,row.names=F)
save.image('231206_GATA3_ChIP_2nd_clsestbed.Rdata') 

UnionDHS regions

Read the file (1st closest GAT to unionHDS summit) in:

closest_1st_GAT_to_unionDHS=read.table("closest_1st_GAT_to_unionDHS_uniq_peak.sorted.bed", header=FALSE)
nrow(closest_1st_GAT_to_unionDHS)
## [1] 7800
head(closest_1st_GAT_to_unionDHS)
##     V1      V2      V3 V4 V5 V6  V7
## 1 chr1  191000  191003 14 14  - GAT
## 2 chr1 1183537 1183540 14 14  - GAT
## 3 chr1 1207290 1207293 14 14  - GAT
## 4 chr1 1548964 1548967 36 36  + GAT
## 5 chr1 1666192 1666195 36 36  + GAT
## 6 chr1 2299000 2299003 36 36  + GAT

Read the GAT coorrdinates file without the 1st closest GAT.

all.GAT.without.1st.closest.ctrl.file=read.table(file = "hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_unionDHS.bed", sep="\t", header=FALSE)
nrow(all.GAT.without.1st.closest.ctrl.file) 
#[1] 64235247
head(all.GAT.without.1st.closest.ctrl.file)
#    V1    V2    V3 V4 V5 V6  V7
#1 chr1 11145 11148 14 14  - GAT
#2 chr1 11160 11163 14 14  - GAT
#3 chr1 11521 11524 36 36  + GAT
#4 chr1 11576 11579 14 14  - GAT
#5 chr1 12188 12191 36 36  + GAT
#6 chr1 13274 13277 36 36  + GAT

Use bedtools closestBed to find the 2nd closest GAT to the 1st closest GAT use the output from bedtools subtract.

closest.2nd.distance.unionDHS=bedTools.closest(bed1 = closest_1st_GAT_to_unionDHS[,1:3], bed2 = all.GAT.without.1st.closest.ctrl.file, opt.string = '-d -t first')
write.table(closest.2nd.distance.unionDHS, file= 'closest.2nd.distance.unionDHS.bed', quote=F,sep="\t",col.names=F,row.names=F)
save.image('231206_GATA3_ChIP_2nd_clsestbed2.Rdata') 

2.4.0.5 Probability Density Function Plot

In the previous section, we have found the 2nd closest GAT to the 1st closest GAT to either GATA3peak subset summits or unionDHS summits.

#GATA3 peak subset
closest.2nd.distance=read.table('closest.2nd.distance.bed',header=F, comment.char='')
nrow(closest.2nd.distance) #7791
## [1] 7791
nrow(closest.2nd.distance[closest.2nd.distance$V9=="+",]) #3839
## [1] 3839
nrow(closest.2nd.distance[closest.2nd.distance$V9=="-",]) #3952
## [1] 3952
head(closest.2nd.distance)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598164 5598167 chr1 5598146 5598149 36 36  + GAT  16
## 2 chr1 8017663 8017666 chr1 8017634 8017637 14 14  - GAT  27
## 3 chr1 8020170 8020173 chr1 8020158 8020161 14 14  - GAT  10
## 4 chr1 8055202 8055205 chr1 8055193 8055196 14 14  - GAT   7
## 5 chr1 8061441 8061444 chr1 8061437 8061440 36 36  + GAT   2
## 6 chr1 8120785 8120788 chr1 8120752 8120755 36 36  + GAT  31
#unionDHS
closest.2nd.distance.unionDHS=read.table('closest.2nd.distance.unionDHS.bed',header=F, comment.char='')
nrow(closest.2nd.distance.unionDHS) # 7800
## [1] 7800
nrow(closest.2nd.distance.unionDHS[closest.2nd.distance.unionDHS$V9=="+",]) #3858
## [1] 3858
nrow(closest.2nd.distance.unionDHS[closest.2nd.distance.unionDHS$V9=="-",]) #3942
## [1] 3942
head(closest.2nd.distance.unionDHS)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1  191000  191003 chr1  191005  191008 14 14  - GAT   3
## 2 chr1 1183537 1183540 chr1 1183725 1183728 14 14  - GAT 186
## 3 chr1 1207290 1207293 chr1 1207289 1207292 36 36  + GAT   0
## 4 chr1 1548964 1548967 chr1 1549048 1549051 36 36  + GAT  82
## 5 chr1 1666192 1666195 chr1 1666232 1666235 36 36  + GAT  38
## 6 chr1 2299000 2299003 chr1 2299054 2299057 36 36  + GAT  52

Subset two data.frames to contain information of the 2nd closest 3mer-GAT info (chr-start-end-strand-dis);
Then add the ‘status’ info and rbind the two dataframe.

df.chip = cbind(closest.2nd.distance[,c(1:3, 9, 11)], "GATA3_peak")
df.ctrl = cbind(closest.2nd.distance.unionDHS[,c(1:3, 9, 11)], "ctrl")
colnames(df.chip) = c(colnames(closest.2nd.distance)[1:3], "strand", "dis", "status")
colnames(df.ctrl) = c(colnames(closest.2nd.distance.unionDHS)[1:3], "strand", "dis", "status")
df.all = rbind(df.chip, df.ctrl)
head(df.all)
##     V1      V2      V3 strand dis     status
## 1 chr1 5598164 5598167      +  16 GATA3_peak
## 2 chr1 8017663 8017666      -  27 GATA3_peak
## 3 chr1 8020170 8020173      -  10 GATA3_peak
## 4 chr1 8055202 8055205      -   7 GATA3_peak
## 5 chr1 8061441 8061444      +   2 GATA3_peak
## 6 chr1 8120785 8120788      +  31 GATA3_peak
nrow(df.all)
## [1] 15591
df.all$dis=as.numeric(df.all$dis)
df.all$strand=as.factor(df.all$strand)
df.all$status=as.factor(df.all$status)
head(df.all)
##     V1      V2      V3 strand dis     status
## 1 chr1 5598164 5598167      +  16 GATA3_peak
## 2 chr1 8017663 8017666      -  27 GATA3_peak
## 3 chr1 8020170 8020173      -  10 GATA3_peak
## 4 chr1 8055202 8055205      -   7 GATA3_peak
## 5 chr1 8061441 8061444      +   2 GATA3_peak
## 6 chr1 8120785 8120788      +  31 GATA3_peak
tail(df.all)
##         V1       V2       V3 strand dis status
## 15586 chrY 19024627 19024630      +   3   ctrl
## 15587 chrY 19319677 19319680      -  22   ctrl
## 15588 chrY 21220963 21220966      -  12   ctrl
## 15589 chrY 21370822 21370825      -  82   ctrl
## 15590 chrY 24575686 24575689      -  11   ctrl
## 15591 chrY 26323055 26323058      +  10   ctrl

Plot PDF:

# interaction variable
df.all$interaction_var <- interaction(df.all$strand, df.all$status)
library(lattice)
library(latticeExtra)

myColors <- ifelse(df.all$strand == "+", "red", "blue")
densityplot( ~ abs(dis) | status,
            groups = strand,
            key = list(text = list(as.character(unique(df.all$strand))), 
                       space="top",
                       lines=list(col=c("red","blue")),
                       columns=nlevels(df.all$strand)),
            data = df.all,
            xlim=c(0,40),
            from=0,
            to=40,
            xlab = "distance (bp) from 2nd GAT to 1st GAT",
            type = "count",
            col = myColors,
            lty = 1,
            main = "ctrl vs. GATA3 peaks",
            par.settings=list(par.xlab.text=list(cex=1.1,font=2), 
                                 par.ylab.text=list(cex=1.1,font=2))
)

In this plot, we use probability density function to plot the relative likelihood at different distances of observing a second closest 3mer GAT to the 1st closest GAT.

The left panel is plotting within the control sets and the right panel is plotting the GATA3 peak subset. In each panel, it shows the distances of 2nd GAT on minus strand (blue) or the distances of the 2nd GAT on plus strand (red) to the 1st closest GAT.

In GATA3 peak subset, most 2nd closest GAT are found at ~9bp distance to the 1st closest GAT. On the other hand, in the control set, we did not see such spike at any fixed position.

This suggest that in GATA ChIP peaks, 3mer GAT-GAT are often found to have fixed spacings between them.

2.5 peaks in other quantiles

In the above analysis we analyzed peak set that are 1) without MEME/STREME found motifs; and 2) in top 20% peak intensity quantile.

In this analysis, we want to include the other 4 lower quantile peaks that without MEME/STREME found motifs.

load files

load files contains selected ChIP peak summit info

chip.peak.summit.quantile1=read.table("quantile1_summits.bed", header=FALSE)
nrow(chip.peak.summit.quantile1)
## [1] 7796
head(chip.peak.summit.quantile1)
##     V1      V2      V3        V4
## 1 chr1 5598187 5598188 271.93122
## 2 chr1 8017660 8017661 676.31362
## 3 chr1 8020178 8020179 324.95549
## 4 chr1 8055212 8055213 111.55874
## 5 chr1 8061475 8061476  98.51927
## 6 chr1 8120791 8120792 124.49564
chip.peak.summit.quantile0.8=read.table("quantile0.8_summits.bed", header=FALSE)
chip.peak.summit.quantile0.6=read.table("quantile0.6_summits.bed", header=FALSE)
chip.peak.summit.quantile0.4=read.table("quantile0.4_summits.bed", header=FALSE)
chip.peak.summit.quantile0.2=read.table("quantile0.2_summits.bed", header=FALSE)
nrow(chip.peak.summit.quantile0.8)
## [1] 7796
head(chip.peak.summit.quantile0.8)
##     V1      V2      V3       V4
## 1 chr1 1073822 1073823 56.70255
## 2 chr1 5637335 5637336 63.60992
## 3 chr1 7019683 7019684 72.40249
## 4 chr1 7266189 7266190 70.66320
## 5 chr1 7494212 7494213 75.17251
## 6 chr1 8026313 8026314 69.95015
summary(chip.peak.summit.quantile0.8$V4)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   52.89   57.45   63.27   64.85   71.43   83.30
nrow(chip.peak.summit.quantile0.6)
## [1] 7796
head(chip.peak.summit.quantile0.6)
##     V1      V2      V3       V4
## 1 chr1  827380  827381 45.32469
## 2 chr1 1074184 1074185 48.18883
## 3 chr1 1124966 1124967 45.46610
## 4 chr1 1157747 1157748 44.19288
## 5 chr1 3587455 3587456 51.91936
## 6 chr1 6199528 6199529 42.31217
summary(chip.peak.summit.quantile0.6$V4)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   40.60   43.11   46.04   46.22   49.20   52.89
nrow(chip.peak.summit.quantile0.4)
## [1] 7796
head(chip.peak.summit.quantile0.4)
##     V1      V2      V3       V4
## 1 chr1  916769  916770 36.69993
## 2 chr1 1013265 1013266 34.70221
## 3 chr1 1013580 1013581 36.26922
## 4 chr1 1021142 1021143 32.69951
## 5 chr1 1158192 1158193 36.21743
## 6 chr1 1231880 1231881 35.47152
summary(chip.peak.summit.quantile0.4$V4)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   31.95   34.14   36.16   36.19   38.22   40.59
nrow(chip.peak.summit.quantile0.2)
## [1] 7594
head(chip.peak.summit.quantile0.2)
##     V1      V2      V3       V4
## 1 chr1  924853  924854 30.44684
## 2 chr1  966653  966654 26.87016
## 3 chr1  999508  999509 21.22455
## 4 chr1 1000536 1000537 24.34263
## 5 chr1 1001891 1001892 25.52872
## 6 chr1 1020187 1020188 20.63151
summary(chip.peak.summit.quantile0.2$V4)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6095 24.3426 27.5137 26.4411 29.9572 31.9466

load files contains 3mer coordinates info

#concatenate the plus and minus file together
all.GAT.file=read.table(file = "hg38.3.3.3.30_plus_minus_GAT.bed", sep="\t", header=FALSE)
head(all.GAT.file)
#    V1    V2    V3 V4 V5 V6  V7
#1 chr1 11145 11148 14 14  - GAT
#2 chr1 11160 11163 14 14  - GAT
#3 chr1 11576 11579 14 14  - GAT
#4 chr1 13315 13318 14 14  - GAT
#5 chr1 19797 19800 14 14  - GAT
#6 chr1 27023 27026 14 14  - GAT
tail(all.GAT.file)
#                             V1    V2    V3 V4 V5 V6  V7
#64243042 chrY_KI270740v1_random 33028 33031 36 36  + GAT
#64243043 chrY_KI270740v1_random 35533 35536 36 36  + GAT
#64243044 chrY_KI270740v1_random 35543 35546 36 36  + GAT
#64243045 chrY_KI270740v1_random 35912 35915 36 36  + GAT
#64243046 chrY_KI270740v1_random 36540 36543 36 36  + GAT
#64243047 chrY_KI270740v1_random 36913 36916 36 36  + GAT

nrow(all.GAT.file)
#[1] 64243047

bedtools closestBed
The below function will sort input bed1 and bed2 first, then run bedtools closestBed between bed1 and bed2.

# define function 
bedTools.closest <- function(functionstring="/home/FCAM/ssun/packages/bedtools2/bin/closestBed",bed1,bed2,opt.string="") {
  
  options(scipen =99) # not use scientific notation when writing out
  
  #write bed formatted data.frames to tempfile
  write.table(bed1,file= 'a.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
  write.table(bed2,file= 'b.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
  
  # create the command string and call the command using system()
  # the command sort a and b file by coordinates
  command1=paste('sort -k1,1 -k2,2n', 'a.file.bed', '> a.file.sorted.bed')
  cat(command1,"\n") #sort -k1,1 -k2,2n a.file.bed > a.file.sorted.bed
  try(system(command1))
  command2=paste('sort -k1,1 -k2,2n', 'b.file.bed', '> b.file.sorted.bed')
  cat(command2,"\n")
  try(system(command2))
  
  # the command call closestBed on bed1 and bed2
  command=paste(functionstring, opt.string,"-a",'a.file.sorted.bed',"-b",'b.file.sorted.bed',">",'out.file.bed',sep=" ")
  cat(command,"\n")
  try(system(command))
  
  res=read.table('out.file.bed',header=F, comment.char='')
  
  # remove intermediate files
  command3=paste('rm', 'a.file.bed', 'b.file.bed', 'a.file.sorted.bed', 'b.file.sorted.bed', 'out.file.bed')
  cat(command3,"\n")
  try(system(command3))
  
  colnames(res) = c(colnames(bed1), colnames(bed2), 'dis' )
  return(res)
}

Parameter -d will report the distance from the closest GAT to the peak summit.
Parameter -t last or -t first will only report either the last or the first entry in bed2 file if tied distance occurred.
Here we choose to use -t first to report the first coordinates when tie occurred.

Find the closest GAT to peak summit regardless of GAT strandedness

all.distance.quantile1=bedTools.closest(bed1 = chip.peak.summit.quantile1[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
all.distance.quantile0.8=bedTools.closest(bed1 = chip.peak.summit.quantile0.8[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
all.distance.quantile0.6=bedTools.closest(bed1 = chip.peak.summit.quantile0.6[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
all.distance.quantile0.4=bedTools.closest(bed1 = chip.peak.summit.quantile0.4[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
all.distance.quantile0.2=bedTools.closest(bed1 = chip.peak.summit.quantile0.2[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')


write.table(all.distance.quantile1,file= 'all.distance.quantile1.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(all.distance.quantile0.8,file= 'all.distance.quantile0.8.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(all.distance.quantile0.6,file= 'all.distance.quantile0.6.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(all.distance.quantile0.4,file= 'all.distance.quantile0.4.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(all.distance.quantile0.2,file= 'all.distance.quantile0.2.bed', quote=F,sep="\t",col.names=F,row.names=F)


save.image('231211_GATA3_ChIP_clsestbed.Rdata') 

In these “all.distance.quantileX’ data frames, V1 to V3 are peak summit coordinates, V4 to V9 are closestBed reported closest motif to each peak summit. The last column (V11) is the distance from closest motif to peak summit.

all.distance.quantile1=read.table('all.distance.quantile1.bed',header=F, comment.char='')
head(all.distance.quantile1)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36  + GAT  21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36  + GAT   3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14  - GAT   6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36  + GAT   8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36  + GAT  32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14  - GAT   4
all.distance.quantile0.8=read.table('all.distance.quantile0.8.bed',header=F, comment.char='')
head(all.distance.quantile1)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36  + GAT  21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36  + GAT   3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14  - GAT   6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36  + GAT   8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36  + GAT  32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14  - GAT   4
all.distance.quantile0.6=read.table('all.distance.quantile0.6.bed',header=F, comment.char='')
head(all.distance.quantile1)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36  + GAT  21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36  + GAT   3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14  - GAT   6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36  + GAT   8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36  + GAT  32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14  - GAT   4
all.distance.quantile0.4=read.table('all.distance.quantile0.4.bed',header=F, comment.char='')
head(all.distance.quantile1)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36  + GAT  21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36  + GAT   3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14  - GAT   6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36  + GAT   8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36  + GAT  32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14  - GAT   4
all.distance.quantile0.2=read.table('all.distance.quantile0.2.bed',header=F, comment.char='')
head(all.distance.quantile1)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36  + GAT  21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36  + GAT   3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14  - GAT   6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36  + GAT   8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36  + GAT  32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14  - GAT   4

coherence check 1: the number of the closest minus GAT is comparable to the number of the closest plus GAT.

nrow(all.distance.quantile0.8)
## [1] 7796
nrow(all.distance.quantile0.8[all.distance.quantile0.8$V9=="+",])
## [1] 3907
nrow(all.distance.quantile0.8[all.distance.quantile0.8$V9=="-",])
## [1] 3889
nrow(all.distance.quantile0.6)
## [1] 7796
nrow(all.distance.quantile0.6[all.distance.quantile0.6$V9=="+",])
## [1] 3982
nrow(all.distance.quantile0.6[all.distance.quantile0.6$V9=="-",])
## [1] 3814
nrow(all.distance.quantile0.4)
## [1] 7796
nrow(all.distance.quantile0.4[all.distance.quantile0.4$V9=="+",])
## [1] 3983
nrow(all.distance.quantile0.4[all.distance.quantile0.4$V9=="-",])
## [1] 3813
nrow(all.distance.quantile0.2)
## [1] 7594
nrow(all.distance.quantile0.2[all.distance.quantile0.2$V9=="+",])
## [1] 3750
nrow(all.distance.quantile0.2[all.distance.quantile0.2$V9=="-",])
## [1] 3844

coherence check 2: Since we have use -t first to take care of the tie, now we have equal number of unique peak with their closest plus or minus or both GAT.

nrow(chip.peak.summit.quantile0.8) # there are 7796 unique peak 
## [1] 7796
nrow(all.distance.quantile0.8)  
## [1] 7796
nrow(chip.peak.summit.quantile0.6) # there are 7796 unique peak 
## [1] 7796
nrow(all.distance.quantile0.6)  
## [1] 7796
nrow(chip.peak.summit.quantile0.4) # there are 7796 unique peak 
## [1] 7796
nrow(all.distance.quantile0.4)  
## [1] 7796
nrow(chip.peak.summit.quantile0.2) # there are 7594 unique peak 
## [1] 7594
nrow(all.distance.quantile0.2)  
## [1] 7594

2.5.1 Cumulative distribution function (CDF) plot

df.chip = data.frame(matrix(nrow = 0, ncol = 5))     
colnames(df.chip) = c("V1","V2","v3", "dis", "status")
for (chip.peak in Sys.glob(file.path("./all.distance.quantile*.bed"))) {
    print(chip.peak)
    quantile.name =strsplit((strsplit(strsplit(chip.peak, "/")[[1]][length(strsplit(chip.peak, "/")[[1]])], 'all.distance.')[[1]][2]), ".bed")[[1]][1]
    print(quantile.name)
    all.distance = cbind(read.table(chip.peak,header=F, comment.char='')[,c(1:3, 11)], quantile.name)
    df.chip = rbind(df.chip, all.distance)
}
## [1] "./all.distance.quantile0.2.bed"
## [1] "quantile0.2"
## [1] "./all.distance.quantile0.4.bed"
## [1] "quantile0.4"
## [1] "./all.distance.quantile0.6.bed"
## [1] "quantile0.6"
## [1] "./all.distance.quantile0.8.bed"
## [1] "quantile0.8"
## [1] "./all.distance.quantile1.bed"
## [1] "quantile1"
str(df.chip)
## 'data.frame':    38778 obs. of  5 variables:
##  $ V1           : chr  "chr1" "chr1" "chr1" "chr1" ...
##  $ V2           : int  924853 966653 999508 1000536 1001891 1020187 1069549 1079599 1163013 1163246 ...
##  $ V3           : int  924854 966654 999509 1000537 1001892 1020188 1069550 1079600 1163014 1163247 ...
##  $ V11          : int  0 65 61 91 46 143 37 29 213 120 ...
##  $ quantile.name: chr  "quantile0.2" "quantile0.2" "quantile0.2" "quantile0.2" ...
unique(df.chip$quantile.name)
## [1] "quantile0.2" "quantile0.4" "quantile0.6" "quantile0.8" "quantile1"

combine all single quantile plot into one plot

colnames(df.chip) = c("V1","V2","V3", "dis", "status")
ctrl.distance=read.table('ctrl.distance.bed',header=F, comment.char='')
df.ctrl = cbind(ctrl.distance[,c(1:3, 11)], "ctrl")
colnames(df.ctrl) = c(colnames(ctrl.distance)[1:3], "dis", "status")
head(df.chip)
##     V1      V2      V3 dis      status
## 1 chr1  924853  924854   0 quantile0.2
## 2 chr1  966653  966654  65 quantile0.2
## 3 chr1  999508  999509  61 quantile0.2
## 4 chr1 1000536 1000537  91 quantile0.2
## 5 chr1 1001891 1001892  46 quantile0.2
## 6 chr1 1020187 1020188 143 quantile0.2
head(df.ctrl)
##     V1      V2      V3 dis status
## 1 chr1  190928  190929  72   ctrl
## 2 chr1 1183557 1183558  18   ctrl
## 3 chr1 1207307 1207308  15   ctrl
## 4 chr1 1548961 1548962   3   ctrl
## 5 chr1 1666170 1666171  22   ctrl
## 6 chr1 2298957 2298958  43   ctrl
df.all=rbind(df.chip, df.ctrl)
df.all$status = factor(df.all$status, levels = c("quantile0.2", "quantile0.4", "quantile0.6", "quantile0.8", "quantile1", "ctrl"))
head(df.all)
##     V1      V2      V3 dis      status
## 1 chr1  924853  924854   0 quantile0.2
## 2 chr1  966653  966654  65 quantile0.2
## 3 chr1  999508  999509  61 quantile0.2
## 4 chr1 1000536 1000537  91 quantile0.2
## 5 chr1 1001891 1001892  46 quantile0.2
## 6 chr1 1020187 1020188 143 quantile0.2
nrow(df.all)
## [1] 46578

plot CDF

We are plotting the cumulative distribution function to evaluate whether a 3mer GAT are closer to the summit of GATA3 ChIP peaks (that might use 3mer GAT as a binding sequence) than they are to the summit of a random subset of union DHS regions (ctrl).

library(lattice)
library(latticeExtra)

ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.all,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c(colorRampPalette(c("pink","red"))(4), "#ce228e", "grey60"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,3.5),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c(colorRampPalette(c("pink","red"))(4), "#ce228e","grey60"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 1.176, lty =2)
             panel.ecdfplot(...)
         })

The above cumulative distribution function (PDF) plot is showing the distance between 3mer-GAT to the summit of GATA3 ChIP peak (red), or summit of the ctrl peak set (gray, random subset from the union DHS regions).

The left-shift of the red curve suggests that the 3mer-GAT are closer to the GATA3 ChIP peak set then the ctrl peak set.

The vertical dashed line indicates at which distance (log10 of 15bp) the two CDF traces became parallel. This suggests that the 3mer-GAT often binds within 15 bp of the GATA3 peak summit.

ecdfplot(~dis, groups = status, data = df.all,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c(colorRampPalette(c("pink","red"))(4), "#ce228e","grey60"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = '3mer-GAT Distance (bp) from peak summit',
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,100),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c(colorRampPalette(c("pink","red"))(4), "#ce228e","grey60"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 15, lty =2)
             panel.ecdfplot(...)
         })

The difference between this CDF with the previous one is: The above one is plotting actual distance in bp, the previous one is plotting log10 of the distance. The two CDF plots are essentially communicating the same thing.

3 for each quantile, at which point they converge?

3.0.1 Probability Density Function Plot

3.0.1.1 all GAT coordinates .bed file

Input1
This is the 3mer GAT coordinates on hg38 genome (both plus and minus strand) generated with SeqOutbias use read size==30.

wc -l hg38.3.3.3.30_plus_minus_GAT.bed #64243047
head -5 hg38.3.3.3.30_plus_minus_GAT.bed
#chr1   11145   11148   14  14  -   GAT
#chr1   11160   11163   14  14  -   GAT
#chr1   11576   11579   14  14  -   GAT
#chr1   13315   13318   14  14  -   GAT
#chr1   19797   19800   14  14  -   GAT

3.0.1.2 GATA3 peak subset (without the 9 GATA3-like motifs)

Input2
This all.distance.quantileX.bed file is searching for closest 3mer GAT to the peak summit (quantile1: top 20% intense peak without motif 123456789; quantile 0.8: second 20%……).
Use awk to select 3mer-GAT’s coordinates from the file.

for i in all.distance.quantile*.bed
do
 nm=$(echo $i | awk -F"all.distance." '{print $2}' | awk -F".bed" '{print $1}')
 echo $nm
 wc -l $i
 head -5 $i
 awk '{print $4, $5, $6, $7, $8, $9, $10}' $i | awk '{$1=$1}1' OFS="\t" > closest_1st_GAT_to_GATA_peak_${nm}.bed
 wc -l closest_1st_GAT_to_GATA_peak_${nm}.bed
 awk '{print $4, $5, $6, $7, $8, $9, $10}' $i | awk '{$1=$1}1' OFS="\t" | sort | uniq > closest_1st_GAT_to_GATA_uniq_peak_${nm}.bed
 wc -l closest_1st_GAT_to_GATA_uniq_peak_${nm}.bed
 head -5 closest_1st_GAT_to_GATA_uniq_peak_${nm}.bed
 awk '$6 == "+" {count++} END {print "Positive strand count:", count}' closest_1st_GAT_to_GATA_uniq_peak_${nm}.bed
 awk '$6 == "-" {count++} END {print "Negative strand count:", count}' closest_1st_GAT_to_GATA_uniq_peak_${nm}.bed
done 2>&1 | tee -a closest_1st_GAT_to_GATA_sort_uniq_log.txt
cat closest_1st_GAT_to_GATA_sort_uniq_log.txt
## quantile0.2
##     7594 all.distance.quantile0.2.bed
## chr1 924853  924854  chr1    924851  924854  14  14  -   GAT 0
## chr1 966653  966654  chr1    966718  966721  36  36  +   GAT 65
## chr1 999508  999509  chr1    999569  999572  14  14  -   GAT 61
## chr1 1000536 1000537 chr1    1000627 1000630 36  36  +   GAT 91
## chr1 1001891 1001892 chr1    1001937 1001940 36  36  +   GAT 46
##     7594 closest_1st_GAT_to_GATA_peak_quantile0.2.bed
##     7563 closest_1st_GAT_to_GATA_uniq_peak_quantile0.2.bed
## chr1 1000627 1000630 36  36  +   GAT
## chr1 1001937 1001940 36  36  +   GAT
## chr1 100351435   100351438   36  36  +   GAT
## chr1 10035675    10035678    36  36  +   GAT
## chr1 101025693   101025696   14  14  -   GAT
## Positive strand count: 3743
## Negative strand count: 3820
## quantile0.4
##     7796 all.distance.quantile0.4.bed
## chr1 916769  916770  chr1    916719  916722  36  36  +   GAT 48
## chr1 1013265 1013266 chr1    1013247 1013250 36  36  +   GAT 16
## chr1 1013580 1013581 chr1    1013584 1013587 36  36  +   GAT 4
## chr1 1021142 1021143 chr1    1021115 1021118 14  14  -   GAT 25
## chr1 1158192 1158193 chr1    1158274 1158277 14  14  -   GAT 82
##     7796 closest_1st_GAT_to_GATA_peak_quantile0.4.bed
##     7781 closest_1st_GAT_to_GATA_uniq_peak_quantile0.4.bed
## chr1 100249748   100249751   36  36  +   GAT
## chr1 101004290   101004293   14  14  -   GAT
## chr1 1013247 1013250 36  36  +   GAT
## chr1 1013584 1013587 36  36  +   GAT
## chr1 10210228    10210231    36  36  +   GAT
## Positive strand count: 3979
## Negative strand count: 3802
## quantile0.6
##     7796 all.distance.quantile0.6.bed
## chr1 827380  827381  chr1    827390  827393  14  14  -   GAT 10
## chr1 1074184 1074185 chr1    1074157 1074160 14  14  -   GAT 25
## chr1 1124966 1124967 chr1    1125026 1125029 36  36  +   GAT 60
## chr1 1157747 1157748 chr1    1157663 1157666 14  14  -   GAT 82
## chr1 3587455 3587456 chr1    3587417 3587420 14  14  -   GAT 36
##     7796 closest_1st_GAT_to_GATA_peak_quantile0.6.bed
##     7785 closest_1st_GAT_to_GATA_uniq_peak_quantile0.6.bed
## chr1 100038217   100038220   14  14  -   GAT
## chr1 10033076    10033079    14  14  -   GAT
## chr1 101371391   101371394   36  36  +   GAT
## chr1 102772386   102772389   36  36  +   GAT
## chr1 10599290    10599293    36  36  +   GAT
## Positive strand count: 3980
## Negative strand count: 3805
## quantile0.8
##     7796 all.distance.quantile0.8.bed
## chr1 1073822 1073823 chr1    1073808 1073811 36  36  +   GAT 12
## chr1 5637335 5637336 chr1    5637333 5637336 36  36  +   GAT 0
## chr1 7019683 7019684 chr1    7019698 7019701 14  14  -   GAT 15
## chr1 7266189 7266190 chr1    7266144 7266147 36  36  +   GAT 43
## chr1 7494212 7494213 chr1    7494203 7494206 14  14  -   GAT 7
##     7796 closest_1st_GAT_to_GATA_peak_quantile0.8.bed
##     7781 closest_1st_GAT_to_GATA_uniq_peak_quantile0.8.bed
## chr1 10514718    10514721    36  36  +   GAT
## chr1 1073808 1073811 36  36  +   GAT
## chr1 107683212   107683215   36  36  +   GAT
## chr1 107688147   107688150   36  36  +   GAT
## chr1 107688516   107688519   36  36  +   GAT
## Positive strand count: 3897
## Negative strand count: 3884
## quantile1
##     7796 all.distance.quantile1.bed
## chr1 5598187 5598188 chr1    5598164 5598167 36  36  +   GAT 21
## chr1 8017660 8017661 chr1    8017663 8017666 36  36  +   GAT 3
## chr1 8020178 8020179 chr1    8020170 8020173 14  14  -   GAT 6
## chr1 8055212 8055213 chr1    8055202 8055205 36  36  +   GAT 8
## chr1 8061475 8061476 chr1    8061441 8061444 36  36  +   GAT 32
##     7796 closest_1st_GAT_to_GATA_peak_quantile1.bed
##     7791 closest_1st_GAT_to_GATA_uniq_peak_quantile1.bed
## chr1 100486183   100486186   14  14  -   GAT
## chr1 10083257    10083260    36  36  +   GAT
## chr1 101797999   101798002   14  14  -   GAT
## chr1 102941741   102941744   14  14  -   GAT
## chr1 10511140    10511143    36  36  +   GAT
## Positive strand count: 3870
## Negative strand count: 3921

Bedtools subtract

-f Requiring a minimal overlap fraction before subtracting. Here we define -f 1.00 to make sure of a 100% overlap between two file before subtracting.
-s Enforcing same “strandedness” while scanning for features in -b file that should be subtracted from -a file.

module load bedtools 
#sort -k1,1 -k2,2n hg38.3.3.3.30_plus_minus_GAT.bed > hg38.3.3.3.30_plus_minus_GAT.sorted.bed
#head -5 hg38.3.3.3.30_plus_minus_GAT.sorted.bed
#chr1   11145   11148   14  14  -   GAT
#chr1   11160   11163   14  14  -   GAT
#chr1   11521   11524   36  36  +   GAT
#chr1   11576   11579   14  14  -   GAT
#chr1   12188   12191   36  36  +   GAT


for i in closest_1st_GAT_to_GATA_uniq_peak_quantile*.bed
do
  nm=$(echo $i | awk -F"closest_1st_GAT_to_GATA_uniq_peak_" '{print $2}' | awk -F".bed" '{print $1}')
  echo $nm
  sort -k1,1 -k2,2n $i > closest_1st_GAT_to_GATA_uniq_peak_${nm}.sorted.bed 
  head -5 closest_1st_GAT_to_GATA_uniq_peak_${nm}.sorted.bed 
  bedtools subtract -a hg38.3.3.3.30_plus_minus_GAT.sorted.bed -b closest_1st_GAT_to_GATA_uniq_peak_${nm}.sorted.bed -f 1.00 -s > hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_${nm}.bed
  wc -l hg38.3.3.3.30_plus_minus_GAT.sorted.bed #64243047
  wc -l closest_1st_GAT_to_GATA_uniq_peak_${nm}.sorted.bed 
  wc -l hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_${nm}.bed 
done 2>&1 | tee -a closest_1st_GAT_to_GATA_Bedsubtract_log.txt
cat closest_1st_GAT_to_GATA_Bedsubtract_log.txt
## quantile0.2
## chr1 924851  924854  14  14  -   GAT
## chr1 966718  966721  36  36  +   GAT
## chr1 999569  999572  14  14  -   GAT
## chr1 1000627 1000630 36  36  +   GAT
## chr1 1001937 1001940 36  36  +   GAT
## 64243047 hg38.3.3.3.30_plus_minus_GAT.sorted.bed
## 7563 closest_1st_GAT_to_GATA_uniq_peak_quantile0.2.sorted.bed
## 64235484 hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_quantile0.2.bed
## quantile0.4
## chr1 916719  916722  36  36  +   GAT
## chr1 1013247 1013250 36  36  +   GAT
## chr1 1013584 1013587 36  36  +   GAT
## chr1 1021115 1021118 14  14  -   GAT
## chr1 1158274 1158277 14  14  -   GAT
## 64243047 hg38.3.3.3.30_plus_minus_GAT.sorted.bed
## 7781 closest_1st_GAT_to_GATA_uniq_peak_quantile0.4.sorted.bed
## 64235266 hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_quantile0.4.bed
## quantile0.6
## chr1 827390  827393  14  14  -   GAT
## chr1 1074157 1074160 14  14  -   GAT
## chr1 1125026 1125029 36  36  +   GAT
## chr1 1157663 1157666 14  14  -   GAT
## chr1 3587417 3587420 14  14  -   GAT
## 64243047 hg38.3.3.3.30_plus_minus_GAT.sorted.bed
## 7785 closest_1st_GAT_to_GATA_uniq_peak_quantile0.6.sorted.bed
## 64235262 hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_quantile0.6.bed
## quantile0.8
## chr1 1073808 1073811 36  36  +   GAT
## chr1 5637333 5637336 36  36  +   GAT
## chr1 7019698 7019701 14  14  -   GAT
## chr1 7266144 7266147 36  36  +   GAT
## chr1 7494203 7494206 14  14  -   GAT
## 64243047 hg38.3.3.3.30_plus_minus_GAT.sorted.bed
## 7781 closest_1st_GAT_to_GATA_uniq_peak_quantile0.8.sorted.bed
## 64235266 hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_quantile0.8.bed
## quantile1
## chr1 5598164 5598167 36  36  +   GAT
## chr1 8017663 8017666 36  36  +   GAT
## chr1 8020170 8020173 14  14  -   GAT
## chr1 8055202 8055205 36  36  +   GAT
## chr1 8061441 8061444 36  36  +   GAT
## 64243047 hg38.3.3.3.30_plus_minus_GAT.sorted.bed
## 7791 closest_1st_GAT_to_GATA_uniq_peak_quantile1.sorted.bed
## 64235256 hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_quantile1.bed

3.0.1.3 unionDHS set (without the 9 GATA3-like motifs)

Input3

bedtools subtract

wc -l hg38.3.3.3.30_plus_minus_GAT.sorted.bed #64243047
wc -l closest_1st_GAT_to_unionDHS_uniq_peak.sorted.bed #7800
wc -l hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_unionDHS.bed #64235247

#64243047-7800
#[1] 64235247

3.0.1.4 Find the 2nd closest GAT to the 1st closest GAT with bedtools closestBed

GATA3 peak subset

Read the file (1st closest GAT to GATA3 peak subset summit) in.
Read the GAT coorrdinates file without the 1st closest GAT to GATA3 peak summit (5 quantile subsets that ranked by intensity of peaks without GATA3-like motifs 123456789).
Use bedtools closestBed to find the 2nd closest GAT to the 1st closest GAT use the output from bedtools subtract.

#load the bedTools.closest function first
for (closest_1st_GAT in Sys.glob(file.path("./closest_1st_GAT_to_GATA_uniq_peak_quantile*sorted.bed"))) {
    print(closest_1st_GAT)
    quantile.name =strsplit((strsplit(strsplit(closest_1st_GAT, "/")[[1]][length(strsplit(closest_1st_GAT, "/")[[1]])], 'closest_1st_GAT_to_GATA_uniq_peak_')[[1]][2]), ".sorted.bed")[[1]][1]
    print(quantile.name)
    
    closest_1st_GAT_to_GATA=read.table(closest_1st_GAT, header=FALSE)
    all.GAT.without.1st.closest.file=read.table(file = paste0("hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_", quantile.name, '.bed'), sep="\t", header=FALSE)
    
    closest.2nd.distance=bedTools.closest(bed1 = closest_1st_GAT_to_GATA[,1:3], bed2 = all.GAT.without.1st.closest.file, opt.string = '-d -t first')
    write.table(closest.2nd.distance,file=paste0("closest.2nd.distance.", quantile.name, '.bed'), quote=F,sep="\t",col.names=F,row.names=F)
}
save.image('231211_GATA3_ChIP_2nd_clsestbed.Rdata') 

UnionDHS regions

This part has already completed in previous analysis, file is saved in ‘closest.2nd.distance.unionDHS.bed’.

3.0.1.5 Probability Density Function Plot

In the previous section, we have found the 2nd closest GAT to the 1st closest GAT to either GATA3peak subset summits or unionDHS summits.

for i in closest.2nd.distance.quantile*.bed
do
 echo $i
 wc -l $i
 head -5 $i
 awk '$9 == "+" {count++} END {print "Positive strand count:", count}' $i
 awk '$9 == "-" {count++} END {print "Negative strand count:", count}' $i
done 2>&1 | tee -a closest_2nd_distance_log.txt
cat closest_2nd_distance_log.txt
## closest.2nd.distance.quantile0.2.bed
## 7563 closest.2nd.distance.quantile0.2.bed
## chr1 924851  924854  chr1    924776  924779  14  14  -   GAT 73
## chr1 966718  966721  chr1    966791  966794  14  14  -   GAT 71
## chr1 999569  999572  chr1    999577  999580  36  36  +   GAT 6
## chr1 1000627 1000630 chr1    1000628 1000631 14  14  -   GAT 0
## chr1 1001937 1001940 chr1    1002046 1002049 14  14  -   GAT 107
## Positive strand count: 3743
## Negative strand count: 3820
## closest.2nd.distance.quantile0.4.bed
## 7781 closest.2nd.distance.quantile0.4.bed
## chr1 916719  916722  chr1    916704  916707  36  36  +   GAT 13
## chr1 1013247 1013250 chr1    1013234 1013237 14  14  -   GAT 11
## chr1 1013584 1013587 chr1    1013519 1013522 14  14  -   GAT 63
## chr1 1021115 1021118 chr1    1021098 1021101 14  14  -   GAT 15
## chr1 1158274 1158277 chr1    1158525 1158528 36  36  +   GAT 249
## Positive strand count: 3917
## Negative strand count: 3864
## closest.2nd.distance.quantile0.6.bed
## 7785 closest.2nd.distance.quantile0.6.bed
## chr1 827390  827393  chr1    827303  827306  14  14  -   GAT 85
## chr1 1074157 1074160 chr1    1073954 1073957 36  36  +   GAT 201
## chr1 1125026 1125029 chr1    1125064 1125067 36  36  +   GAT 36
## chr1 1157663 1157666 chr1    1157653 1157656 14  14  -   GAT 8
## chr1 3587417 3587420 chr1    3587362 3587365 36  36  +   GAT 53
## Positive strand count: 3833
## Negative strand count: 3952
## closest.2nd.distance.quantile0.8.bed
## 7781 closest.2nd.distance.quantile0.8.bed
## chr1 1073808 1073811 chr1    1073954 1073957 36  36  +   GAT 144
## chr1 5637333 5637336 chr1    5637324 5637327 36  36  +   GAT 7
## chr1 7019698 7019701 chr1    7019703 7019706 36  36  +   GAT 3
## chr1 7266144 7266147 chr1    7266136 7266139 14  14  -   GAT 6
## chr1 7494203 7494206 chr1    7494226 7494229 36  36  +   GAT 21
## Positive strand count: 3955
## Negative strand count: 3826
## closest.2nd.distance.quantile1.bed
## 7791 closest.2nd.distance.quantile1.bed
## chr1 5598164 5598167 chr1    5598146 5598149 36  36  +   GAT 16
## chr1 8017663 8017666 chr1    8017634 8017637 14  14  -   GAT 27
## chr1 8020170 8020173 chr1    8020158 8020161 14  14  -   GAT 10
## chr1 8055202 8055205 chr1    8055193 8055196 14  14  -   GAT 7
## chr1 8061441 8061444 chr1    8061437 8061440 36  36  +   GAT 2
## Positive strand count: 3839
## Negative strand count: 3952

Subset two data.frames to contain information of the 2nd closest 3mer-GAT info (chr-start-end-strand-dis);
Then add the ‘status’ info and rbind the two dataframe.

df.chip = data.frame(matrix(nrow = 0, ncol = 6))     
for (chip.peak in Sys.glob(file.path("./closest.2nd.distance.quantile*.bed"))) {
    print(chip.peak)
    quantile.name =strsplit((strsplit(strsplit(chip.peak, "/")[[1]][length(strsplit(chip.peak, "/")[[1]])], 'closest.2nd.distance.')[[1]][2]), ".bed")[[1]][1]
    print(quantile.name)
    
    all.distance = cbind(read.table(chip.peak,header=F, comment.char='')[,c(4:6, 9, 11)], quantile.name)
    df.chip = rbind(df.chip, all.distance)
}
## [1] "./closest.2nd.distance.quantile0.2.bed"
## [1] "quantile0.2"
## [1] "./closest.2nd.distance.quantile0.4.bed"
## [1] "quantile0.4"
## [1] "./closest.2nd.distance.quantile0.6.bed"
## [1] "quantile0.6"
## [1] "./closest.2nd.distance.quantile0.8.bed"
## [1] "quantile0.8"
## [1] "./closest.2nd.distance.quantile1.bed"
## [1] "quantile1"
str(df.chip)
## 'data.frame':    38701 obs. of  6 variables:
##  $ V4           : chr  "chr1" "chr1" "chr1" "chr1" ...
##  $ V5           : int  924776 966791 999577 1000628 1002046 1020331 1069481 1079763 1162642 1163472 ...
##  $ V6           : int  924779 966794 999580 1000631 1002049 1020334 1069484 1079766 1162645 1163475 ...
##  $ V9           : chr  "-" "-" "+" "-" ...
##  $ V11          : int  73 71 6 0 107 0 27 193 154 104 ...
##  $ quantile.name: chr  "quantile0.2" "quantile0.2" "quantile0.2" "quantile0.2" ...
unique(df.chip$quantile.name)
## [1] "quantile0.2" "quantile0.4" "quantile0.6" "quantile0.8" "quantile1"

combine all single quantile plot into one plot

colnames(df.chip) = c("V1","V2","v3", "strand", "dis", "status")
closest.2nd.distance.unionDHS=read.table('closest.2nd.distance.unionDHS.bed',header=F, comment.char='')
df.ctrl = cbind(closest.2nd.distance.unionDHS[,c(4:6, 9, 11)], "ctrl")
colnames(df.ctrl) = c("V1","V2","v3", "strand", "dis", "status")
head(df.chip)
##     V1      V2      v3 strand dis      status
## 1 chr1  924776  924779      -  73 quantile0.2
## 2 chr1  966791  966794      -  71 quantile0.2
## 3 chr1  999577  999580      +   6 quantile0.2
## 4 chr1 1000628 1000631      -   0 quantile0.2
## 5 chr1 1002046 1002049      - 107 quantile0.2
## 6 chr1 1020331 1020334      -   0 quantile0.2
head(df.ctrl)
##     V1      V2      v3 strand dis status
## 1 chr1  191005  191008      -   3   ctrl
## 2 chr1 1183725 1183728      - 186   ctrl
## 3 chr1 1207289 1207292      +   0   ctrl
## 4 chr1 1549048 1549051      +  82   ctrl
## 5 chr1 1666232 1666235      +  38   ctrl
## 6 chr1 2299054 2299057      +  52   ctrl
df.all=rbind(df.chip, df.ctrl)
df.all$status = factor(df.all$status, levels = c("quantile1", "quantile0.8", "quantile0.6",  "quantile0.4", "quantile0.2", "ctrl"))
df.all$dis=as.integer(df.all$dis)
df.all$strand=as.factor(df.all$strand)
head(df.all)
##     V1      V2      v3 strand dis      status
## 1 chr1  924776  924779      -  73 quantile0.2
## 2 chr1  966791  966794      -  71 quantile0.2
## 3 chr1  999577  999580      +   6 quantile0.2
## 4 chr1 1000628 1000631      -   0 quantile0.2
## 5 chr1 1002046 1002049      - 107 quantile0.2
## 6 chr1 1020331 1020334      -   0 quantile0.2
nrow(df.all)
## [1] 46501

Plot PDF:

library(lattice)
library(latticeExtra)

densityplot( ~ abs(dis) | status,
            groups = strand,
            auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
            data = df.all,
            col=c("blue", "red"),
            aspect = 2.5,
            xlim=c(0,50),
            from=0,
            to=50,
            layout=c(6,1),
            xlab = "distance (bp) from 2nd GAT to 1st GAT",
            main = "ctrl vs. GATA3 peaks",
            #between=list(y=1.0),
            type = "count",
            lty = c(1),
            lwd=2,
            par.settings = list(superpose.line = list(col=c("blue", "red"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 9, lty =2)
             panel.abline(v= 2, lty =2)
             panel.densityplot(...)
         })

In this plot, we use probability density function to plot the relative likelihood at different distances of observing a second closest 3mer GAT to the 1st closest GAT.

It is showing a dosage effect when we decreased peak intensity (from quantile 1 to quantile 0.2). The best set is quantile 1.

make a single plot

library(lattice)
library(latticeExtra)

densityplot( ~ abs(dis),
            groups = status,
            data = df.all,
            auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
            col=c("#ce228e", colorRampPalette(c("red","pink"))(4), "grey60"),
            aspect = 1,
            xlim=c(0,50),
            from=0,
            to=50,
            #layout=c(1,2),
            xlab = "distance (bp) from 2nd GAT to 1st GAT",
            main = "ctrl vs. GATA3 peaks",
            between=list(y=1.0),
            type = "count",
            lty = c(1),
            lwd=2,
            par.settings = list(superpose.line = list(col=c("#ce228e", colorRampPalette(c("red","pink"))(4),"grey60"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 9, lty =2)
             panel.abline(v= 2, lty =2)
             panel.densityplot(...)
         })

In the above PDF plot, the spike is at 2bp distance and 9bp distance. It is important to know that this is repoting the likelihood, not the actual counts at this two location.

If we count and compare how many entries has 0, 1, 2, 9, 10, 11bp distance, we would see that the counts is apparently higher in 0 and 10bp than in 2 and 9bp.

head closest.2nd.distance.quantile1.bed
wc -l closest.2nd.distance.quantile1.bed
awk '$11 == 0' closest.2nd.distance.quantile1.bed | wc -l
awk '$11 == 1' closest.2nd.distance.quantile1.bed | wc -l
awk '$11 == 2' closest.2nd.distance.quantile1.bed | wc -l
awk '$11 == 9' closest.2nd.distance.quantile1.bed | wc -l
awk '$11 == 10' closest.2nd.distance.quantile1.bed | wc -l
awk '$11 == 11' closest.2nd.distance.quantile1.bed | wc -l
## chr1 5598164 5598167 chr1    5598146 5598149 36  36  +   GAT 16
## chr1 8017663 8017666 chr1    8017634 8017637 14  14  -   GAT 27
## chr1 8020170 8020173 chr1    8020158 8020161 14  14  -   GAT 10
## chr1 8055202 8055205 chr1    8055193 8055196 14  14  -   GAT 7
## chr1 8061441 8061444 chr1    8061437 8061440 36  36  +   GAT 2
## chr1 8120785 8120788 chr1    8120752 8120755 36  36  +   GAT 31
## chr1 8182615 8182618 chr1    8182616 8182619 14  14  -   GAT 0
## chr1 8197832 8197835 chr1    8197815 8197818 14  14  -   GAT 15
## chr1 8204648 8204651 chr1    8204655 8204658 36  36  +   GAT 5
## chr1 8258971 8258974 chr1    8258981 8258984 14  14  -   GAT 8
##     7791 closest.2nd.distance.quantile1.bed
##     1095
##      187
##      137
##      317
##      524
##      446

Probability density plot is reporting the likelihood (y axis) of observing an event (in our case, have a 2nd closest GAT at n position) in a defined range (this is defined by from and to, in the above plot it is 50).
It is important to set from and to arguments inside density(). Because by default, density will extend the range so that the density curve approaches 0 at the extreme.

If I change the defined range from (0,50) to (0,1000), the plot would look different:

densityplot( ~ abs(dis),
            groups = status,
            data = df.all,
            auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
            col=c("#ce228e", colorRampPalette(c("red","pink"))(4), "grey60"),
            aspect = 1,
            xlim=c(0,50),
            from=0,
            to=1000,
            #layout=c(1,2),
            xlab = "distance (bp) from 2nd GAT to 1st GAT",
            main = "ctrl vs. GATA3 peaks",
            between=list(y=1.0),
            type = "count",
            lty = c(1),
            lwd=2,
            par.settings = list(superpose.line = list(col=c("#ce228e", colorRampPalette(c("red","pink"))(4),"grey60"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 10, lty =2)
             panel.abline(v= 2, lty =2)
             panel.densityplot(...)
         })

Now the highest spike in quantile1 became 10bp distance.

histogram

The density plots are good to visualize the distribution of continuous numeric variables. The densityplot() uses kernel density probability estimate to calculate the density probability of numeric variables.
Since we have discrete distance (in single bp resolution), histogram might be more suitable.
Histograms are constructed by binning the data and counting the number of observations in each bin. We can set the binwidth to be 1bp.

str(df.all)
## 'data.frame':    46501 obs. of  6 variables:
##  $ V1    : chr  "chr1" "chr1" "chr1" "chr1" ...
##  $ V2    : int  924776 966791 999577 1000628 1002046 1020331 1069481 1079763 1162642 1163472 ...
##  $ v3    : int  924779 966794 999580 1000631 1002049 1020334 1069484 1079766 1162645 1163475 ...
##  $ strand: Factor w/ 2 levels "-","+": 1 1 2 1 1 1 1 2 1 2 ...
##  $ dis   : int  73 71 6 0 107 0 27 193 154 104 ...
##  $ status: Factor w/ 6 levels "quantile1","quantile0.8",..: 5 5 5 5 5 5 5 5 5 5 ...
summary(df.all$dis)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     7.00    20.00    38.63    48.00 24806.00
histogram(~ abs(dis) | status,
            groups = strand,
            key = list(space = "right", text=list(c("- strand", "+ strand"), col=c("blue","red"), cex=1)),
            data = df.all,
            col=c("blue", "red"),
            aspect = 3,
            xlim=c(0,20),
            breaks=seq(from=0,to=24806, by=1),
            layout=c(6,1),
            xlab = "distance (bp) from 2nd GAT to 1st GAT",
            main = "ctrl vs. GATA3 peaks",
            between=list(y=1.0),
            type="density",
            lwd=2,
            par.settings = list(superpose.line = list(col=c("blue", "red"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 10, lty =2)
             panel.abline(v= 1, lty =2)
             panel.densityplot(...)
             panel.histogram(...)
         })

str(df.all)
## 'data.frame':    46501 obs. of  6 variables:
##  $ V1    : chr  "chr1" "chr1" "chr1" "chr1" ...
##  $ V2    : int  924776 966791 999577 1000628 1002046 1020331 1069481 1079763 1162642 1163472 ...
##  $ v3    : int  924779 966794 999580 1000631 1002049 1020334 1069484 1079766 1162645 1163475 ...
##  $ strand: Factor w/ 2 levels "-","+": 1 1 2 1 1 1 1 2 1 2 ...
##  $ dis   : int  73 71 6 0 107 0 27 193 154 104 ...
##  $ status: Factor w/ 6 levels "quantile1","quantile0.8",..: 5 5 5 5 5 5 5 5 5 5 ...
summary(df.all$dis)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     7.00    20.00    38.63    48.00 24806.00
histogram(~ abs(dis),
            groups = status,
            data=df.all,
            auto.key = list(space = "right", rectangles=FALSE, lines=TRUE, points=FALSE, cex = 1),
            #col=c("#ce228e", colorRampPalette(c("red","pink"))(4), "grey60"),
            aspect = 1,
            xlim=c(0,30),
            breaks=seq(from=0,to=max(df.all$dis), by=1),
            xlab = "distance (bp) from 2nd GAT to 1st GAT",
            main = "ctrl vs. GATA3 peaks",
            between=list(y=1.0),
            type="density",
            lwd=2,
            par.settings = list(superpose.line = list(col=c("#ce228e", colorRampPalette(c("red","pink"))(4),"grey60"), lwd=3), strip.background=list(col="grey85")),
            panel = function(...) {
             panel.abline(v= 10, lty =2, col="red")
             panel.abline(v= 1, lty =2, col="red")
             panel.histogram(...)
             panel.densityplot(...) #darg=list(bw = 0.5, kernel="gaussian"),
             panel.superpose(..., panel.groups=panel.histogram,
                          col=c("#ce228e", colorRampPalette(c("red","pink"))(4),"grey60"),alpha=0.4)
         })

Conclusion: I think, quantile1 (top 20% intensity) is where we could make the cutoff in the entire GATA3 ChIP peaks. These peaks are: 1) higher intensity; 2) more than 70% peaks has GATA-like motifs found by MEME/STREME; 3) for peaks lack MEME/STREME found motif, we have good enrichment of GAT near peak summit, and a good 3mer-3mer structure.

3.1 Genome browser

3.1.1 visualize 3mer-GAT distribution relative to the defined “closest GAT” on Genome Browser

files to upload:
-3mer-GAT coordinates file (+/-) (in .bigWig format);
-several annotation bed files including: 1) peak summit files; 2) closest GAT to peak summit; 3) 2nd closest GAT to peak summit.

library(bigWig)
peak.region.summit=center.bed(read.table('/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/de_novo_motif/without_motifs_123456_789.bed', sep = "\t", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
write.table(peak.region.summit,file= 'without_motifs_123456_789_summits.bed', quote=F,sep="\t",col.names=F,row.names=F)
# 1)
cat without_motifs_123456_789_summits.bed | sort -k1,1 -k2,2n > without_motifs_123456_789_summits.sorted.bed

# 2)
cat closest_1st_GAT_to_GATA_uniq_peak_quantile0.2.bed closest_1st_GAT_to_GATA_uniq_peak_quantile0.4.bed closest_1st_GAT_to_GATA_uniq_peak_quantile0.6.bed closest_1st_GAT_to_GATA_uniq_peak_quantile0.8.bed closest_1st_GAT_to_GATA_uniq_peak_quantile1.bed | sort -k1,1 -k2,2n > closest_1st_GAT_to_GATA_uniq_peak_quantile_all.sorted.bed

awk '{print $1, $2, $3, $4, $5, $6}' closest_1st_GAT_to_GATA_uniq_peak_quantile_all.sorted.bed | awk '{$1=$1}1' OFS="\t" > closest_1st_GAT_to_GATA_uniq_peak_quantile_all.sorted2.bed

# 3)
cat closest.2nd.distance.quantile1.bed closest.2nd.distance.quantile0.8.bed closest.2nd.distance.quantile0.2.bed closest.2nd.distance.quantile0.4.bed closest.2nd.distance.quantile0.6.bed | sort -k1,1 -k2,2n > closest.2nd.distance.quantile.all.sorted.bed

awk '{print $4, $5, $6, $7, $8, $9}' closest.2nd.distance.quantile.all.sorted.bed | awk '{$1=$1}1' OFS="\t" > closest.2nd.distance.quantile.all.sorted2.bed

Add trackline:

awk 'BEGIN {print "browser position chr10:16,000-17,000" 
            print "track type=bed name=\"peak.withoutmotifs.summit.bed\" description=\"peak.withoutmotifs.summit_bed\" visibility=full autoScale=on useScore=1 color=0,0,0"
            } {print $0}' without_motifs_123456_789_summits.sorted.bed > without_motifs_123456_789_summits.sorted.header.bed
           
awk 'BEGIN {print "browser position chr10:16,000-17,000" 
            print "track type=bed name=\"closest.1st.GAT_bed\" description=\"closest.1st.GAT_bed\" visibility=full autoScale=on useScore=1 color=255,0,0"
            } {print $0}' closest_1st_GAT_to_GATA_uniq_peak_quantile_all.sorted2.bed > closest_1st_GAT_to_GATA_uniq_peak_quantile_all.sorted.header.bed
           
awk 'BEGIN {print "browser position chr10:16,000-17,000" 
            print "track type=bed name=\"closest.2nd.GAT_bed\" description=\"closest.2nd.GAT_bed\" visibility=full autoScale=on useScore=1 color=0,0,255"
            } {print $0}' closest.2nd.distance.quantile.all.sorted2.bed > closest.2nd.distance.quantile.all.sorted.header.bed

convert to bigWig
sort the .bed file:

#GAT
dir=/labs/Guertin/siyu/Sathyan_GATA3_ChIP_pool1_pool2/overrep_3mer/hg38_full_kmer3_rs30/seqdump/
plus_file=${dir}hg38.3.3.3plus.36_GAT.bed
minus_file=${dir}hg38.3.3.3minus.14_GAT.bed  
echo "plus_GAT (.bed)"
wc -l ${plus_file}
head -5  ${plus_file}
echo "minus_GAT (.bed)"
wc -l ${minus_file}
head -5 ${minus_file}


sort -k1,1 -k2,2n ${plus_file} > hg38.3.3.3plus.36_GAT_sorted.bed
sort -k1,1 -k2,2n ${minus_file} > hg38.3.3.3minus.14_GAT_sorted.bed
wc -l hg38.3.3.3plus.36_GAT_sorted.bed
wc -l hg38.3.3.3minus.14_GAT_sorted.bed

Use bedtools genomecov -bg to convert .bed to .bedGraph.

module load bedtools
module load ucsc_genome/2012.05.22

sizes=/home/FCAM/ssun/Genome_pro/hg38.chrom.sizes
#bedtools genomecov tool
bedtools genomecov -bg -i hg38.3.3.3plus.36_GAT_sorted.bed -g ${sizes} > hg38.3.3.3plus.36_GAT.bedGraph 
bedtools genomecov -bg -i hg38.3.3.3minus.14_GAT_sorted.bed -g ${sizes} > hg38.3.3.3minus.14_GAT.bedGraph 
wc -l hg38.3.3.3plus.36_GAT.bedGraph 
wc -l hg38.3.3.3minus.14_GAT.bedGraph 

The final step is to convert .bedGraph to .bigWig with UCSC bedGraphToBigWig tool.

#UCSC bedGraphToBigWig tool
bedGraphToBigWig hg38.3.3.3plus.36_GAT.bedGraph ${sizes} hg38.3.3.3plus.36_GAT.bigWig
bedGraphToBigWig hg38.3.3.3minus.14_GAT.bedGraph ${sizes} hg38.3.3.3minus.14_GAT.bigWig
cat UCSCGB.sh_7536738.out
## plus_GAT (.bed)
## 32147038 /labs/Guertin/siyu/Sathyan_GATA3_ChIP_pool1_pool2/overrep_3mer/hg38_full_kmer3_rs30/seqdump/hg38.3.3.3plus.36_GAT.bed
## chr1 11521   11524   36  36  +   GAT
## chr1 12188   12191   36  36  +   GAT
## chr1 13274   13277   36  36  +   GAT
## chr1 13314   13317   36  36  +   GAT
## chr1 15171   15174   36  36  +   GAT
## minus_GAT (.bed)
## 32096009 /labs/Guertin/siyu/Sathyan_GATA3_ChIP_pool1_pool2/overrep_3mer/hg38_full_kmer3_rs30/seqdump/hg38.3.3.3minus.14_GAT.bed
## chr1 11145   11148   14  14  -   GAT
## chr1 11160   11163   14  14  -   GAT
## chr1 11576   11579   14  14  -   GAT
## chr1 13315   13318   14  14  -   GAT
## chr1 19797   19800   14  14  -   GAT
## 32147038 hg38.3.3.3plus.36_GAT_sorted.bed
## 32096009 hg38.3.3.3minus.14_GAT_sorted.bed
## 31502719 hg38.3.3.3plus.36_GAT.bedGraph
## 31455328 hg38.3.3.3minus.14_GAT.bedGraph

**Note that the .bedGraph converted from .bed with bedtools genomecov has less genome coordinates entry. This is because the original .bed file contains adjacent regions, and bedtools genomecov collapse these regions into a single entry in the BEDGraph file by merging or summarizing the coverage data.

Trackhub link: http://guertinlab.cam.uchc.edu/over3mer_hub/hub.txt
The annotation BED files need to be upload via Custom Track.

Sved Section:

UCSC Genome browser image

Figure 2: UCSC Genome browser image

I want to locate regions where the distance between closest 1st and closest 2nd GAT are 0bp, 1bp, 2bp, 9bp and 10bp.

UCSC Genome browser image: the distance between closest 1st and closest 2nd GAT is 0bp.

Figure 3: UCSC Genome browser image: the distance between closest 1st and closest 2nd GAT is 0bp

UCSC Genome browser image: the distance between closest 1st and closest 2nd GAT is 1bp.

Figure 4: UCSC Genome browser image: the distance between closest 1st and closest 2nd GAT is 1bp

UCSC Genome browser image: the distance between closest 1st and closest 2nd GAT is 2bp.

Figure 5: UCSC Genome browser image: the distance between closest 1st and closest 2nd GAT is 2bp

UCSC Genome browser image: the distance between closest 1st and closest 2nd GAT is 9bp.

Figure 6: UCSC Genome browser image: the distance between closest 1st and closest 2nd GAT is 9bp

UCSC Genome browser image: the distance between closest 1st and closest 2nd GAT is 10bp.

Figure 7: UCSC Genome browser image: the distance between closest 1st and closest 2nd GAT is 10bp

Notice that if closestBed found overlap features, the distance will be reported as 0; if no overlaps are found, closestBed will look for the feature in B that is closest (that is, least genomic distance to the start or end of A) to A.
If B is upstream of A, then it will use the start of A to subtract the end of B, then add 1;
If B is downstream of A, then it will use the start of B to subtract the end of A, then add 1.

Since closestBed is adding 1bp in its calculation of distance, the actual fixed spacings between two 3mer will be distance -1.

3.2 Other negative control

3.2.1 CTCF peak sets

Besides the union DHS regions from ENCODE, we also performed CTCF ChIP-seq with MCF7 cells, which could serve as an alternative negative control for this analysis.

Review of CTCF ChIP-seq:

Using MACS3 and two biological replicates of CTCF_CC libraries, we identified 94738 CTCF ChIP peaks.
To make this a proper control for our GATA3 peak sets (that without MEME/STREME motifs), we use MAST to remove CTCF peaks regions that contains GATA3-like motif 123456789 (use same p-value stringency).

We have 9 motifs that found by MEME/STREME software.

ls GATA3_MEME_STREME_motifs/
#cat GATA3_MEME_STREME_motifs/meme_1.txt
## AGATAAM_streme.txt
## AGATDNHATCT_streme.txt
## AGATNDWNAGATARN_meme.txt_meme_4.txt
## BTTATCWGATB_meme_5.txt_meme.txt
## TGATAA_streme.txt
## WGATBDHRVAGATAA_meme.txt_meme_6.txt
## meme_1.txt
## meme_2.txt
## meme_3.txt

First convert CTCF ChIP-peak file (100bp window) from .bed to .fasta.

#cd /home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/overrep_3mer/CTCF_peak_without_123456789
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa

dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/peak_call/CTCF_peak/
fastaFromBed -fi $genome -bed ${dir}CTCF_ChIP_summit_100window.bed -fo CTCF_ChIP_summit_100window.fasta

wc -l ${dir}CTCF_ChIP_summit_100window.bed #94738

MEME (mast uses default p-value: 0.0001)

module load meme/5.4.1
module load R/4.1.2
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa

#round1
mast -hit_list -best ../GATA3_MEME_STREME_motifs/meme_1.txt CTCF_ChIP_summit_100window.fasta > mast_GATA3_PSWM_in_CTCF_round1.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_CTCF_round1.txt
wc -l mast_GATA3_PSWM_in_CTCF_round1.bed #584
intersectBed -v -a ${dir}CTCF_ChIP_summit_100window.bed -b mast_GATA3_PSWM_in_CTCF_round1.bed > CTCF_without_motifs_1.bed
wc -l CTCF_without_motifs_1.bed #94154

#round2
fastaFromBed -fi $genome -bed CTCF_without_motifs_1.bed -fo CTCF_without_motifs_1.fasta
mast -hit_list -best ../GATA3_MEME_STREME_motifs/meme_2.txt CTCF_without_motifs_1.fasta > mast_GATA3_PSWM_in_CTCF_round2.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_CTCF_round2.txt
wc -l mast_GATA3_PSWM_in_CTCF_round2.bed #675
intersectBed -v -a CTCF_without_motifs_1.bed -b mast_GATA3_PSWM_in_CTCF_round2.bed > CTCF_without_motifs_12.bed
wc -l CTCF_without_motifs_12.bed #93479

#round3
fastaFromBed -fi $genome -bed CTCF_without_motifs_12.bed -fo CTCF_without_motifs_12.fasta
mast -hit_list -best ../GATA3_MEME_STREME_motifs/meme_3.txt CTCF_without_motifs_12.fasta > mast_GATA3_PSWM_in_CTCF_round3.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_CTCF_round3.txt
wc -l mast_GATA3_PSWM_in_CTCF_round3.bed #541
intersectBed -v -a CTCF_without_motifs_12.bed -b mast_GATA3_PSWM_in_CTCF_round3.bed > CTCF_without_motifs_123.bed
wc -l CTCF_without_motifs_123.bed # 92938

#round4
fastaFromBed -fi $genome -bed CTCF_without_motifs_123.bed -fo CTCF_without_motifs_123.fasta
mast -hit_list -best ../GATA3_MEME_STREME_motifs/AGATNDWNAGATARN_meme.txt_meme_4.txt CTCF_without_motifs_123.fasta > mast_GATA3_PSWM_in_CTCF_round4.txt 
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_CTCF_round4.txt
wc -l mast_GATA3_PSWM_in_CTCF_round4.bed #614
intersectBed -v -a CTCF_without_motifs_123.bed -b mast_GATA3_PSWM_in_CTCF_round4.bed > CTCF_without_motifs_1234.bed
wc -l CTCF_without_motifs_1234.bed #92324

#round5
fastaFromBed -fi $genome -bed CTCF_without_motifs_1234.bed -fo CTCF_without_motifs_1234.fasta
mast -hit_list -best ../GATA3_MEME_STREME_motifs/BTTATCWGATB_meme_5.txt_meme.txt CTCF_without_motifs_1234.fasta > mast_GATA3_PSWM_in_CTCF_round5.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_CTCF_round5.txt
wc -l mast_GATA3_PSWM_in_CTCF_round5.bed #490
intersectBed -v -a CTCF_without_motifs_1234.bed -b mast_GATA3_PSWM_in_CTCF_round5.bed > CTCF_without_motifs_12345.bed
wc -l CTCF_without_motifs_12345.bed #91834


#round6
fastaFromBed -fi $genome -bed CTCF_without_motifs_12345.bed -fo CTCF_without_motifs_12345.fasta
mast -hit_list -best ../GATA3_MEME_STREME_motifs/WGATBDHRVAGATAA_meme.txt_meme_6.txt CTCF_without_motifs_12345.fasta > mast_GATA3_PSWM_in_CTCF_round6.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_CTCF_round6.txt
wc -l mast_GATA3_PSWM_in_CTCF_round6.bed #653
intersectBed -v -a CTCF_without_motifs_12345.bed -b mast_GATA3_PSWM_in_CTCF_round6.bed > CTCF_without_motifs_123456.bed 
wc -l CTCF_without_motifs_123456.bed #91181

STREME (mast uses p-value of 0.0005)

#round7
fastaFromBed -fi $genome -bed CTCF_without_motifs_123456.bed -fo CTCF_without_motifs_123456.fasta
mast -mt 0.0005 -hit_list -best ../GATA3_MEME_STREME_motifs/AGATAAM_streme.txt CTCF_without_motifs_123456.fasta > mast_GATA3_PSWM_in_CTCF_round7.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_CTCF_round7.txt
wc -l mast_GATA3_PSWM_in_CTCF_round7.bed #2692
intersectBed -v -a CTCF_without_motifs_123456.bed -b mast_GATA3_PSWM_in_CTCF_round7.bed > CTCF_without_motifs_123456_7.bed
wc -l CTCF_without_motifs_123456_7.bed #88489 


#round8
fastaFromBed -fi $genome -bed CTCF_without_motifs_123456_7.bed -fo CTCF_without_motifs_123456_7.fasta
mast -mt 0.0005 -hit_list -best ../GATA3_MEME_STREME_motifs/TGATAA_streme.txt CTCF_without_motifs_123456_7.fasta > mast_GATA3_PSWM_in_CTCF_round8.txt 
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_CTCF_round8.txt
wc -l mast_GATA3_PSWM_in_CTCF_round8.bed #1914
intersectBed -v -a CTCF_without_motifs_123456_7.bed -b mast_GATA3_PSWM_in_CTCF_round8.bed > CTCF_without_motifs_123456_78.bed 
wc -l CTCF_without_motifs_123456_78.bed  #86575

#round9
fastaFromBed -fi $genome -bed CTCF_without_motifs_123456_78.bed -fo CTCF_without_motifs_123456_78.fasta
mast -mt 0.0005 -hit_list -best ../GATA3_MEME_STREME_motifs/AGATDNHATCT_streme.txt CTCF_without_motifs_123456_78.fasta > mast_GATA3_PSWM_in_CTCF_round9.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_CTCF_round9.txt
wc -l mast_GATA3_PSWM_in_CTCF_round9.bed #1840
intersectBed -v -a CTCF_without_motifs_123456_78.bed -b mast_GATA3_PSWM_in_CTCF_round9.bed > CTCF_without_motifs_123456_789.bed
wc -l CTCF_without_motifs_123456_789.bed #84735

Randomly subset the CTCF peaks
After the removal of the 9 GATA3 motifs with mast, the CTCF peak set now remains 84735 regions (originally 94738, removed ~10% regions).

Here I am using shuf to randomly select 7800 CTCF peaks to match with the GATA peak subset (quantile1: 7796 peaks) I am going to compare with.

shuf -n 7800 CTCF_without_motifs_123456_789.bed > random_7800_CTCF_without_motifs_123456_789.bed
wc -l random_7800_CTCF_without_motifs_123456_789.bed 
head random_7800_CTCF_without_motifs_123456_789.bed
##     7800 random_7800_CTCF_without_motifs_123456_789.bed
## chr18    48951784    48951885    CTCF_ChIP_peak_36395    18.0917
## chr19    3146694 3146795 CTCF_ChIP_peak_37117    150.667
## chr13    51445062    51445163    CTCF_ChIP_peak_20942    238.503
## chr1 206937764   206937865   CTCF_ChIP_peak_6646 4.83363
## chr16    76854379    76854480    CTCF_ChIP_peak_30360    292.784
## chrX 153887154   153887255   CTCF_ChIP_peak_84880a   4.15912
## chr17    63882339    63882440    CTCF_ChIP_peak_33796b   396.571
## chr14    20460965    20461066    CTCF_ChIP_peak_21906    5.58175
## chr9 19085464    19085565    CTCF_ChIP_peak_78691    134.583
## chr13    50126140    50126241    CTCF_ChIP_peak_20902a   12.1923

Read the file in and use the center as regulatory region summit:

library(bigWig)
chip.peak=read.table("random_7800_CTCF_without_motifs_123456_789.bed", header=FALSE)
nrow(chip.peak)
## [1] 7800
head(chip.peak)
##      V1        V2        V3                    V4        V5
## 1 chr18  48951784  48951885  CTCF_ChIP_peak_36395  18.09170
## 2 chr19   3146694   3146795  CTCF_ChIP_peak_37117 150.66700
## 3 chr13  51445062  51445163  CTCF_ChIP_peak_20942 238.50300
## 4  chr1 206937764 206937865   CTCF_ChIP_peak_6646   4.83363
## 5 chr16  76854379  76854480  CTCF_ChIP_peak_30360 292.78400
## 6  chrX 153887154 153887255 CTCF_ChIP_peak_84880a   4.15912
chip.peak.summit=center.bed(chip.peak, upstreamWindow=0, downstreamWindow=0)
nrow(chip.peak.summit)
## [1] 7800
head(chip.peak.summit)
##      V1        V2        V3                    V4        V5
## 1 chr18  48951834  48951835  CTCF_ChIP_peak_36395  18.09170
## 2 chr19   3146744   3146745  CTCF_ChIP_peak_37117 150.66700
## 3 chr13  51445112  51445113  CTCF_ChIP_peak_20942 238.50300
## 4  chr1 206937814 206937815   CTCF_ChIP_peak_6646   4.83363
## 5 chr16  76854429  76854430  CTCF_ChIP_peak_30360 292.78400
## 6  chrX 153887204 153887205 CTCF_ChIP_peak_84880a   4.15912

find the closest GAT to summit of unionDHS with closestBed.

all.GAT.file=read.table(file = "../peak_without_123456789/hg38.3.3.3.30_plus_minus_GAT.bed", sep="\t", header=FALSE)
CTCF.ctrl.distance=bedTools.closest(bed1 = chip.peak.summit[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
write.table(CTCF.ctrl.distance,file= 'CTCF.ctrl.distance.bed', quote=F,sep="\t",col.names=F,row.names=F)
save.image('231214_GATA3_ChIP_clsestbed.Rdata') 
wc -l CTCF.ctrl.distance.bed
head -5 CTCF.ctrl.distance.bed

awk '{print $4, $5, $6, $7, $8, $9, $10}' CTCF.ctrl.distance.bed | awk '{$1=$1}1' OFS="\t"> closest_1st_GAT_to_CTCF_peak.bed
wc -l closest_1st_GAT_to_CTCF_peak.bed

awk '{print $4, $5, $6, $7, $8, $9, $10}' CTCF.ctrl.distance.bed | awk '{$1=$1}1' OFS="\t" | sort | uniq> closest_1st_GAT_to_CTCF_uniq_peak.bed
wc -l closest_1st_GAT_to_CTCF_uniq_peak.bed 
##     7800 CTCF.ctrl.distance.bed
## chr1 826944  826945  chr1    826956  826959  36  36  +   GAT 12
## chr1 913010  913011  chr1    912992  912995  14  14  -   GAT 16
## chr1 1116689 1116690 chr1    1116512 1116515 14  14  -   GAT 175
## chr1 1247245 1247246 chr1    1247393 1247396 36  36  +   GAT 148
## chr1 1305399 1305400 chr1    1305387 1305390 14  14  -   GAT 10
##     7800 closest_1st_GAT_to_CTCF_peak.bed
##     7788 closest_1st_GAT_to_CTCF_uniq_peak.bed
head -5 closest_1st_GAT_to_CTCF_uniq_peak.bed
awk '$6 == "+" {count++} END {print "Positive strand count:", count}' closest_1st_GAT_to_CTCF_uniq_peak.bed
awk '$6 == "-" {count++} END {print "Positive strand count:", count}' closest_1st_GAT_to_CTCF_uniq_peak.bed
## chr1 10033721    10033724    36  36  +   GAT
## chr1 100463958   100463961   14  14  -   GAT
## chr1 101357518   101357521   14  14  -   GAT
## chr1 101769729   101769732   14  14  -   GAT
## chr1 101971482   101971485   14  14  -   GAT
## Positive strand count: 4004
## Positive strand count: 3784

bedtools subtract
Subtract the 1st closest GAT from all.GAT, then find the closest 2nd GAT to the closest 1st GAT.
-f Requiring a minimal overlap fraction before subtracting. Here we define -f 1.00 to make sure of a 100% overlap between two file before subtracting.
-s Enforcing same “strandedness” while scanning for features in -b file that should be subtracted from -a file.

module load bedtools 
sort -k1,1 -k2,2n ../peak_without_123456789/hg38.3.3.3.30_plus_minus_GAT.bed > hg38.3.3.3.30_plus_minus_GAT.sorted.bed
head -5 hg38.3.3.3.30_plus_minus_GAT.sorted.bed
#chr1   11145   11148   14  14  -   GAT
#chr1   11160   11163   14  14  -   GAT
#chr1   11521   11524   36  36  +   GAT
#chr1   11576   11579   14  14  -   GAT
#chr1   12188   12191   36  36  +   GAT

sort -k1,1 -k2,2n closest_1st_GAT_to_CTCF_uniq_peak.bed > closest_1st_GAT_to_CTCF_uniq_peak.sorted.bed 
head -5 closest_1st_GAT_to_CTCF_uniq_peak.sorted.bed
#chr1   826956  826959  36  36  +   GAT
#chr1   912992  912995  14  14  -   GAT
#chr1   1116512 1116515 14  14  -   GAT
#chr1   1247393 1247396 36  36  +   GAT
#chr1   1305387 1305390 14  14  -   GAT

bedtools subtract -a hg38.3.3.3.30_plus_minus_GAT.sorted.bed -b closest_1st_GAT_to_CTCF_uniq_peak.sorted.bed -f 1.00 -s > hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_CTCF.bed

wc -l hg38.3.3.3.30_plus_minus_GAT.sorted.bed #64243047
wc -l closest_1st_GAT_to_CTCF_uniq_peak.sorted.bed #7788
wc -l hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_CTCF.bed #64235259

#64243047-7788
#[1] 64235259

Read the file (1st closest GAT to CTCF summit) in:

closest_1st_GAT_to_CTCF=read.table("closest_1st_GAT_to_CTCF_uniq_peak.sorted.bed", header=FALSE)
nrow(closest_1st_GAT_to_CTCF)
## [1] 7788
head(closest_1st_GAT_to_CTCF)
##     V1      V2      V3 V4 V5 V6  V7
## 1 chr1  826956  826959 36 36  + GAT
## 2 chr1  912992  912995 14 14  - GAT
## 3 chr1 1116512 1116515 14 14  - GAT
## 4 chr1 1247393 1247396 36 36  + GAT
## 5 chr1 1305387 1305390 14 14  - GAT
## 6 chr1 1308225 1308228 14 14  - GAT

Read the GAT coorrdinates file without the 1st closest GAT.

all.GAT.without.1st.closest.ctrl.file=read.table(file = "hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_CTCF.bed", sep="\t", header=FALSE)
nrow(all.GAT.without.1st.closest.ctrl.file) 
#64235259
head(all.GAT.without.1st.closest.ctrl.file)
#    V1    V2    V3 V4 V5 V6  V7
#1 chr1 11145 11148 14 14  - GAT
#2 chr1 11160 11163 14 14  - GAT
#3 chr1 11521 11524 36 36  + GAT
#4 chr1 11576 11579 14 14  - GAT
#5 chr1 12188 12191 36 36  + GAT
#6 chr1 13274 13277 36 36  + GAT

Use bedtools closestBed to find the 2nd closest GAT to the 1st closest GAT use the output from bedtools subtract.

closest.2nd.distance.CTCF=bedTools.closest(bed1 = closest_1st_GAT_to_CTCF[,1:3], bed2 = all.GAT.without.1st.closest.ctrl.file, opt.string = '-d -t first')
write.table(closest.2nd.distance.CTCF, file= 'closest.2nd.distance.CTCF.bed', quote=F,sep="\t",col.names=F,row.names=F)
save.image('231214_GATA3_ChIP_2nd_clsestbed.Rdata') 

We decided to not use the CTCF ChIP peak as the negative control because it will bias the comparison. Since CTCF peaks will be enriched of CTCF motifs and these binding element will have fewer GAT, ATC…

4 Goals and plans

Goals:

-Look for software capable of searching for motifs with variable spacings.
-Determine the criteria for identifying a GATA3 binding element (e.g., GATA3 can bind to 3mer-3mer instances with a non-fixed spacing range from x to y).
-refine some details: 1) Define “spacing” as the relative position of one zinc finger to another. 2) Consider strand orientation: distinguish between cases where 3mer-3mer instances on the same strand, with cases that one 3mer on the positive (+) and the other on the negative (-) strand.
-Consider the creation of communication-oriented figures (e.g., PWMs of known enriched GATA3 motifs, histograms depicting 3mer-3mer distributions representing motif structure).

Plans:

  1. Prioritize finding software capable of searching for motifs with variable spacings.

  2. Determine negative control:
    Extend 200bp to peak summit for local control; generate a CDF of the closest GAT distances and a PDF of the distance between the 2nd GAT to the 1st GAT. Compare this local control against the unionDHS control.
    Meanwhile, search for union DHS regions in MCF7 cells from ENCODE.

  3. Display 3mer-3mer distributions for positive controls using histograms (use peak sets with enriched motifs that have known 3mer-3mer structures).

  4. Analyze all 64 3mer bigWigs to find the closest 3mer to peak summit/control summit. Draw CDFs and prioritize a subset of 3mers based on proximity to the GATA3 peak summit compared to the control.


  5. For each prioritized 3mer, identify the second closest 3mer by looping through all 64 3mers. Present histogram/PDF of the distances.
    Consolidate the PDFs into a single plot, with each trace representing a 3mer-3mer pair showcasing the relative distance between their zinc finger binding sites compare to negative control.
    Use this data we could make criteria for the GATA3 binding element: 1) threshold of spacing; and 2) potential 3mer-3mer combinations.

  6. We could also perform the same analysis on spaced 3mer (i.e., GXTA) (seqOutbias).

5 Week of Dec 18th

5.1 local control

Goal: add 200bp to the original peak summit, use this new locus as “control summit” to look for 1st and 2nd closest GAT.

peak summit files in quantiles These are the peak summit files that are ranked from high intensity to low intensity in 5 quantiles.

for i in *quantile*summits.bed
do
  wc -l $i
  head -5 $i
done
##     7594 quantile0.2_summits.bed
## chr1 924853  924854  30.4468389666785
## chr1 966653  966654  26.8701605998383
## chr1 999508  999509  21.2245532255625
## chr1 1000536 1000537 24.342633538906
## chr1 1001891 1001892 25.5287170527213
##     7796 quantile0.4_summits.bed
## chr1 916769  916770  36.6999257213148
## chr1 1013265 1013266 34.7022145570112
## chr1 1013580 1013581 36.2692161076724
## chr1 1021142 1021143 32.6995098507957
## chr1 1158192 1158193 36.2174294299052
##     7796 quantile0.6_summits.bed
## chr1 827380  827381  45.324686100056
## chr1 1074184 1074185 48.1888265664752
## chr1 1124966 1124967 45.4661019134305
## chr1 1157747 1157748 44.1928777685044
## chr1 3587455 3587456 51.9193632691721
##     7796 quantile0.8_summits.bed
## chr1 1073822 1073823 56.7025480761926
## chr1 5637335 5637336 63.6099238799638
## chr1 7019683 7019684 72.4024868144889
## chr1 7266189 7266190 70.6631991025767
## chr1 7494212 7494213 75.1725145398337
##     7796 quantile1_summits.bed
## chr1 5598187 5598188 271.931218584276
## chr1 8017660 8017661 676.313615413536
## chr1 8020178 8020179 324.955493506146
## chr1 8055212 8055213 111.558738286523
## chr1 8061475 8061476 98.519268615125

These are all 1bp-centered region of GATA3 peaks without the MEMNE/STREME found motifs.

We will add 200bp to the downstream of these peak summit as the “local control”.

library(bigWig)
#quantile1
quantile1_summits.200bpctrl=center.bed(read.table(file="quantile1_summits.bed", sep="\t", header=FALSE), upstreamWindow=-200, downstreamWindow=200)
head(read.table(file="quantile1_summits.bed",sep="\t", header=FALSE))
##     V1      V2      V3        V4
## 1 chr1 5598187 5598188 271.93122
## 2 chr1 8017660 8017661 676.31362
## 3 chr1 8020178 8020179 324.95549
## 4 chr1 8055212 8055213 111.55874
## 5 chr1 8061475 8061476  98.51927
## 6 chr1 8120791 8120792 124.49564
head(quantile1_summits.200bpctrl)
##     V1      V2      V3        V4
## 1 chr1 5598387 5598388 271.93122
## 2 chr1 8017860 8017861 676.31362
## 3 chr1 8020378 8020379 324.95549
## 4 chr1 8055412 8055413 111.55874
## 5 chr1 8061675 8061676  98.51927
## 6 chr1 8120991 8120992 124.49564
#write.table(quantile1_summits.200bpctrl, file= 'quantile1_summits.200bpctrl.bed', quote=F,sep="\t",col.names=F,row.names=F)
#quantile0.8
quantile0.8_summits.200bpctrl=center.bed(read.table(file="quantile0.8_summits.bed", sep="\t", header=FALSE), upstreamWindow=-200, downstreamWindow=200)
head(read.table(file="quantile0.8_summits.bed",sep="\t", header=FALSE))
##     V1      V2      V3       V4
## 1 chr1 1073822 1073823 56.70255
## 2 chr1 5637335 5637336 63.60992
## 3 chr1 7019683 7019684 72.40249
## 4 chr1 7266189 7266190 70.66320
## 5 chr1 7494212 7494213 75.17251
## 6 chr1 8026313 8026314 69.95015
head(quantile0.8_summits.200bpctrl)
##     V1      V2      V3       V4
## 1 chr1 1074022 1074023 56.70255
## 2 chr1 5637535 5637536 63.60992
## 3 chr1 7019883 7019884 72.40249
## 4 chr1 7266389 7266390 70.66320
## 5 chr1 7494412 7494413 75.17251
## 6 chr1 8026513 8026514 69.95015
#write.table(quantile0.8_summits.200bpctrl, file= 'quantile0.8_summits.200bpctrl.bed', quote=F,sep="\t",col.names=F,row.names=F)
#quantile0.6
quantile0.6_summits.200bpctrl=center.bed(read.table(file="quantile0.6_summits.bed", sep="\t", header=FALSE), upstreamWindow=-200, downstreamWindow=200)
head(read.table(file="quantile0.6_summits.bed",sep="\t", header=FALSE))
##     V1      V2      V3       V4
## 1 chr1  827380  827381 45.32469
## 2 chr1 1074184 1074185 48.18883
## 3 chr1 1124966 1124967 45.46610
## 4 chr1 1157747 1157748 44.19288
## 5 chr1 3587455 3587456 51.91936
## 6 chr1 6199528 6199529 42.31217
head(quantile0.6_summits.200bpctrl)
##     V1      V2      V3       V4
## 1 chr1  827580  827581 45.32469
## 2 chr1 1074384 1074385 48.18883
## 3 chr1 1125166 1125167 45.46610
## 4 chr1 1157947 1157948 44.19288
## 5 chr1 3587655 3587656 51.91936
## 6 chr1 6199728 6199729 42.31217
#write.table(quantile0.6_summits.200bpctrl, file= 'quantile0.6_summits.200bpctrl.bed', quote=F,sep="\t",col.names=F,row.names=F)
#quantile0.4
quantile0.4_summits.200bpctrl=center.bed(read.table(file="quantile0.4_summits.bed", sep="\t", header=FALSE), upstreamWindow=-200, downstreamWindow=200)
head(read.table(file="quantile0.4_summits.bed",sep="\t", header=FALSE))
##     V1      V2      V3       V4
## 1 chr1  916769  916770 36.69993
## 2 chr1 1013265 1013266 34.70221
## 3 chr1 1013580 1013581 36.26922
## 4 chr1 1021142 1021143 32.69951
## 5 chr1 1158192 1158193 36.21743
## 6 chr1 1231880 1231881 35.47152
head(quantile0.4_summits.200bpctrl)
##     V1      V2      V3       V4
## 1 chr1  916969  916970 36.69993
## 2 chr1 1013465 1013466 34.70221
## 3 chr1 1013780 1013781 36.26922
## 4 chr1 1021342 1021343 32.69951
## 5 chr1 1158392 1158393 36.21743
## 6 chr1 1232080 1232081 35.47152
#write.table(quantile0.4_summits.200bpctrl, file= 'quantile0.4_summits.200bpctrl.bed', quote=F,sep="\t",col.names=F,row.names=F)
#quantile0.2
quantile0.2_summits.200bpctrl=center.bed(read.table(file="quantile0.2_summits.bed", sep="\t", header=FALSE), upstreamWindow=-200, downstreamWindow=200)
head(read.table(file="quantile0.2_summits.bed",sep="\t", header=FALSE))
##     V1      V2      V3       V4
## 1 chr1  924853  924854 30.44684
## 2 chr1  966653  966654 26.87016
## 3 chr1  999508  999509 21.22455
## 4 chr1 1000536 1000537 24.34263
## 5 chr1 1001891 1001892 25.52872
## 6 chr1 1020187 1020188 20.63151
head(quantile0.2_summits.200bpctrl)
##     V1      V2      V3       V4
## 1 chr1  925053  925054 30.44684
## 2 chr1  966853  966854 26.87016
## 3 chr1  999708  999709 21.22455
## 4 chr1 1000736 1000737 24.34263
## 5 chr1 1002091 1002092 25.52872
## 6 chr1 1020387 1020388 20.63151
#write.table(quantile0.2_summits.200bpctrl, file= 'quantile0.2_summits.200bpctrl.bed', quote=F,sep="\t",col.names=F,row.names=F)

5.2 Find closest 1st GAT to local control peak summit

In this section, we use bedtools closestBed (refer to: https://bedtools.readthedocs.io/en/latest/content/tools/closest.html) to find the closest GAT to each provided peak summit.
Input:
-a is the sorted peak summit file (centered 1bp);
-b is the sorted, and concatenated GAT coordinates file (both plus and minus);

Input1 -a: peak: quantile1_summits.bed : peaks without motif 123456789, top20% intensity

for i in *quantile*summits.200bpctrl.bed
do
  wc -l $i
  head -5 $i
done
##     7594 quantile0.2_summits.200bpctrl.bed
## chr1 925053  925054  30.4468389666785
## chr1 966853  966854  26.8701605998383
## chr1 999708  999709  21.2245532255625
## chr1 1000736 1000737 24.342633538906
## chr1 1002091 1002092 25.5287170527213
##     7796 quantile0.4_summits.200bpctrl.bed
## chr1 916969  916970  36.6999257213148
## chr1 1013465 1013466 34.7022145570112
## chr1 1013780 1013781 36.2692161076724
## chr1 1021342 1021343 32.6995098507957
## chr1 1158392 1158393 36.2174294299052
##     7796 quantile0.6_summits.200bpctrl.bed
## chr1 827580  827581  45.324686100056
## chr1 1074384 1074385 48.1888265664752
## chr1 1125166 1125167 45.4661019134305
## chr1 1157947 1157948 44.1928777685044
## chr1 3587655 3587656 51.9193632691721
##     7796 quantile0.8_summits.200bpctrl.bed
## chr1 1074022 1074023 56.7025480761926
## chr1 5637535 5637536 63.6099238799638
## chr1 7019883 7019884 72.4024868144889
## chr1 7266389 7266390 70.6631991025767
## chr1 7494412 7494413 75.1725145398337
##     7796 quantile1_summits.200bpctrl.bed
## chr1 5598387 5598388 271.931218584276
## chr1 8017860 8017861 676.313615413536
## chr1 8020378 8020379 324.955493506146
## chr1 8055412 8055413 111.558738286523
## chr1 8061675 8061676 98.519268615125

Input2 -b: concatenated (plus and minus) GAT coordinates on full hg38 use read size==30

wc -l hg38.3.3.3.30_plus_minus_GAT.bed #64243047
head -5 hg38.3.3.3.30_plus_minus_GAT.bed
#chr1   11145   11148   14  14  -   GAT
#chr1   11160   11163   14  14  -   GAT
#chr1   11576   11579   14  14  -   GAT
#chr1   13315   13318   14  14  -   GAT
#chr1   19797   19800   14  14  -   GAT

load files

load files contains 3mer coordinates info

#concatenate the plus and minus file together
all.GAT.file=read.table(file = "hg38.3.3.3.30_plus_minus_GAT.bed", sep="\t", header=FALSE)
head(all.GAT.file)
#    V1    V2    V3 V4 V5 V6  V7
#1 chr1 11145 11148 14 14  - GAT
#2 chr1 11160 11163 14 14  - GAT
#3 chr1 11576 11579 14 14  - GAT
#4 chr1 13315 13318 14 14  - GAT
#5 chr1 19797 19800 14 14  - GAT
#6 chr1 27023 27026 14 14  - GAT
tail(all.GAT.file)
#                             V1    V2    V3 V4 V5 V6  V7
#64243042 chrY_KI270740v1_random 33028 33031 36 36  + GAT
#64243043 chrY_KI270740v1_random 35533 35536 36 36  + GAT
#64243044 chrY_KI270740v1_random 35543 35546 36 36  + GAT
#64243045 chrY_KI270740v1_random 35912 35915 36 36  + GAT
#64243046 chrY_KI270740v1_random 36540 36543 36 36  + GAT
#64243047 chrY_KI270740v1_random 36913 36916 36 36  + GAT

nrow(all.GAT.file)
#[1] 64243047

bedtools closestBed

Parameter -d will report the distance from the closest GAT to the peak summit.
Parameter -t last or -t first will only report either the last or the first entry in bed2 file if tied distance occurred.
Here we choose to use -t first to report the first coordinates when tie occurred.

Find the closest GAT to peak summit regardless of GAT strandedness

# load closestBed function
localctrl.200.distance.quantile1=bedTools.closest(bed1 = quantile1_summits.200bpctrl[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
localctrl.200.distance.quantile0.8=bedTools.closest(bed1 = quantile0.8_summits.200bpctrl[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
localctrl.200.distance.quantile0.6=bedTools.closest(bed1 = quantile0.6_summits.200bpctrl[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
localctrl.200.distance.quantile0.4=bedTools.closest(bed1 = quantile0.4_summits.200bpctrl[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
localctrl.200.distance.quantile0.2=bedTools.closest(bed1 = quantile0.2_summits.200bpctrl[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')


write.table(localctrl.200.distance.quantile1,file= 'localctrl.200.distance.quantile1.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(localctrl.200.distance.quantile0.8,file= 'localctrl.200.distance.quantile0.8.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(localctrl.200.distance.quantile0.6,file= 'localctrl.200.distance.quantile0.6.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(localctrl.200.distance.quantile0.4,file= 'localctrl.200.distance.quantile0.4.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(localctrl.200.distance.quantile0.2,file= 'localctrl.200.distance.quantile0.2.bed', quote=F,sep="\t",col.names=F,row.names=F)

save.image('231218_GATA3_ChIP_clsestbed.Rdata') 

5.2.1 prepare data

negative control 1
In previous analysis, we have used union DHS regions as negative control, and have saved the 1st closest GAT information in file “ctrl.distance.bed”.

Read in the file in R, notice that V1 to V3 are peak summit coordinates, V4 to V9 are closestBed reported closest GAT 3mer to each peak summit. The last column (V11) is the distance from closest motif to peak summit.

ctrl.distance=read.table('ctrl.distance.bed',header=F, comment.char='')
head(ctrl.distance)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1  190928  190929 chr1  191000  191003 14 14  - GAT  72
## 2 chr1 1183557 1183558 chr1 1183537 1183540 14 14  - GAT  18
## 3 chr1 1207307 1207308 chr1 1207290 1207293 14 14  - GAT  15
## 4 chr1 1548961 1548962 chr1 1548964 1548967 36 36  + GAT   3
## 5 chr1 1666170 1666171 chr1 1666192 1666195 36 36  + GAT  22
## 6 chr1 2298957 2298958 chr1 2299000 2299003 36 36  + GAT  43
nrow(ctrl.distance)
## [1] 7800
nrow(ctrl.distance[ctrl.distance$V9=="+",])
## [1] 3894
nrow(ctrl.distance[ctrl.distance$V9=="-",])
## [1] 3906

negative control2
We have also just created 5 files each with the GATA3 peak summit shifted to downstream 200bp, these file will be used as a local control.

Notice that the first three column is the local control summit (200bp downstream of real peak summit), and column 4 to 10 is the 1st closest GAT coordinates found by closestbed, the last column is the distance.

for i in localctrl.200.distance.quantile*.bed
do
  wc -l $i
  head -5 $i
done
##     7594 localctrl.200.distance.quantile0.2.bed
## chr1 925053  925054  chr1    925112  925115  36  36  +   GAT 59
## chr1 966853  966854  chr1    966791  966794  14  14  -   GAT 60
## chr1 999708  999709  chr1    999712  999715  36  36  +   GAT 4
## chr1 1000736 1000737 chr1    1000812 1000815 36  36  +   GAT 76
## chr1 1002091 1002092 chr1    1002070 1002073 14  14  -   GAT 19
##     7796 localctrl.200.distance.quantile0.4.bed
## chr1 916969  916970  chr1    916863  916866  14  14  -   GAT 104
## chr1 1013465 1013466 chr1    1013519 1013522 14  14  -   GAT 54
## chr1 1013780 1013781 chr1    1013786 1013789 36  36  +   GAT 6
## chr1 1021342 1021343 chr1    1021340 1021343 14  14  -   GAT 0
## chr1 1158392 1158393 chr1    1158274 1158277 14  14  -   GAT 116
##     7796 localctrl.200.distance.quantile0.6.bed
## chr1 827580  827581  chr1    827507  827510  14  14  -   GAT 71
## chr1 1074384 1074385 chr1    1074387 1074390 14  14  -   GAT 3
## chr1 1125166 1125167 chr1    1125159 1125162 36  36  +   GAT 5
## chr1 1157947 1157948 chr1    1157924 1157927 14  14  -   GAT 21
## chr1 3587655 3587656 chr1    3587651 3587654 14  14  -   GAT 2
##     7796 localctrl.200.distance.quantile0.8.bed
## chr1 1074022 1074023 chr1    1073954 1073957 36  36  +   GAT 66
## chr1 5637535 5637536 chr1    5637600 5637603 14  14  -   GAT 65
## chr1 7019883 7019884 chr1    7019883 7019886 36  36  +   GAT 0
## chr1 7266389 7266390 chr1    7266414 7266417 36  36  +   GAT 25
## chr1 7494412 7494413 chr1    7494399 7494402 14  14  -   GAT 11
##     7796 localctrl.200.distance.quantile1.bed
## chr1 5598387 5598388 chr1    5598362 5598365 14  14  -   GAT 23
## chr1 8017860 8017861 chr1    8017864 8017867 14  14  -   GAT 4
## chr1 8020378 8020379 chr1    8020369 8020372 36  36  +   GAT 7
## chr1 8055412 8055413 chr1    8055353 8055356 14  14  -   GAT 57
## chr1 8061675 8061676 chr1    8061666 8061669 36  36  +   GAT 7

Read tne file in R with distance info of local control.

localctrl.200.distance.1=read.table('localctrl.200.distance.quantile1.bed',header=F, comment.char='')
localctrl.200.distance.0.8=read.table('localctrl.200.distance.quantile0.8.bed',header=F, comment.char='')
localctrl.200.distance.0.6=read.table('localctrl.200.distance.quantile0.6.bed',header=F, comment.char='')
localctrl.200.distance.0.4=read.table('localctrl.200.distance.quantile0.4.bed',header=F, comment.char='')
localctrl.200.distance.0.2=read.table('localctrl.200.distance.quantile0.2.bed',header=F, comment.char='')

head(localctrl.200.distance.1)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598387 5598388 chr1 5598362 5598365 14 14  - GAT  23
## 2 chr1 8017860 8017861 chr1 8017864 8017867 14 14  - GAT   4
## 3 chr1 8020378 8020379 chr1 8020369 8020372 36 36  + GAT   7
## 4 chr1 8055412 8055413 chr1 8055353 8055356 14 14  - GAT  57
## 5 chr1 8061675 8061676 chr1 8061666 8061669 36 36  + GAT   7
## 6 chr1 8120991 8120992 chr1 8120993 8120996 14 14  - GAT   2
nrow(localctrl.200.distance.1)
## [1] 7796
nrow(localctrl.200.distance.1[localctrl.200.distance.1$V9=="+",])
## [1] 3855
nrow(localctrl.200.distance.1[localctrl.200.distance.1$V9=="-",])
## [1] 3941

GATA3 peaks

We have also generated 5 quantile files with 1st closest GAT to GATA3 peak summits info. These GATA3 peaks are peaks without MEME/MOTIF found motifs and have ranked by peak intensity.

Notice that V1 to V3 are peak summit coordinates, V4 to V9 are closestBed reported closest motif to each peak summit. The last column (V11) is the distance from closest motif to peak summit.

all.distance.quantile1=read.table('all.distance.quantile1.bed',header=F, comment.char='')
head(all.distance.quantile1)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36  + GAT  21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36  + GAT   3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14  - GAT   6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36  + GAT   8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36  + GAT  32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14  - GAT   4
all.distance.quantile0.8=read.table('all.distance.quantile0.8.bed',header=F, comment.char='')
head(all.distance.quantile1)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36  + GAT  21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36  + GAT   3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14  - GAT   6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36  + GAT   8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36  + GAT  32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14  - GAT   4
all.distance.quantile0.6=read.table('all.distance.quantile0.6.bed',header=F, comment.char='')
head(all.distance.quantile1)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36  + GAT  21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36  + GAT   3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14  - GAT   6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36  + GAT   8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36  + GAT  32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14  - GAT   4
all.distance.quantile0.4=read.table('all.distance.quantile0.4.bed',header=F, comment.char='')
head(all.distance.quantile1)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36  + GAT  21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36  + GAT   3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14  - GAT   6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36  + GAT   8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36  + GAT  32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14  - GAT   4
all.distance.quantile0.2=read.table('all.distance.quantile0.2.bed',header=F, comment.char='')
head(all.distance.quantile1)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36  + GAT  21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36  + GAT   3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14  - GAT   6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36  + GAT   8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36  + GAT  32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14  - GAT   4

combine data

GATA3 peak info + distance + status.

df.chip = data.frame(matrix(nrow = 0, ncol = 5))     
colnames(df.chip) = c("V1","V2","v3", "dis", "status")
for (chip.peak in Sys.glob(file.path("./all.distance.quantile*.bed"))) {
    print(chip.peak)
    quantile.name =strsplit((strsplit(strsplit(chip.peak, "/")[[1]][length(strsplit(chip.peak, "/")[[1]])], 'all.distance.')[[1]][2]), ".bed")[[1]][1]
    print(quantile.name)
    all.distance = cbind(read.table(chip.peak,header=F, comment.char='')[,c(1:3, 11)], quantile.name)
    df.chip = rbind(df.chip, all.distance)
}
## [1] "./all.distance.quantile0.2.bed"
## [1] "quantile0.2"
## [1] "./all.distance.quantile0.4.bed"
## [1] "quantile0.4"
## [1] "./all.distance.quantile0.6.bed"
## [1] "quantile0.6"
## [1] "./all.distance.quantile0.8.bed"
## [1] "quantile0.8"
## [1] "./all.distance.quantile1.bed"
## [1] "quantile1"
str(df.chip)
## 'data.frame':    38778 obs. of  5 variables:
##  $ V1           : chr  "chr1" "chr1" "chr1" "chr1" ...
##  $ V2           : int  924853 966653 999508 1000536 1001891 1020187 1069549 1079599 1163013 1163246 ...
##  $ V3           : int  924854 966654 999509 1000537 1001892 1020188 1069550 1079600 1163014 1163247 ...
##  $ V11          : int  0 65 61 91 46 143 37 29 213 120 ...
##  $ quantile.name: chr  "quantile0.2" "quantile0.2" "quantile0.2" "quantile0.2" ...
unique(df.chip$quantile.name)
## [1] "quantile0.2" "quantile0.4" "quantile0.6" "quantile0.8" "quantile1"
colnames(df.chip) = c("V1","V2","V3", "dis", "status")
head(df.chip)
##     V1      V2      V3 dis      status
## 1 chr1  924853  924854   0 quantile0.2
## 2 chr1  966653  966654  65 quantile0.2
## 3 chr1  999508  999509  61 quantile0.2
## 4 chr1 1000536 1000537  91 quantile0.2
## 5 chr1 1001891 1001892  46 quantile0.2
## 6 chr1 1020187 1020188 143 quantile0.2

union DHS peak info + distance +status.

ctrl.distance=read.table('ctrl.distance.bed',header=F, comment.char='')
df.ctrl = cbind(ctrl.distance[,c(1:3, 11)], "unionDHS_ctrl")
colnames(df.ctrl) = c(colnames(ctrl.distance)[1:3], "dis", "status")
head(df.ctrl)
##     V1      V2      V3 dis        status
## 1 chr1  190928  190929  72 unionDHS_ctrl
## 2 chr1 1183557 1183558  18 unionDHS_ctrl
## 3 chr1 1207307 1207308  15 unionDHS_ctrl
## 4 chr1 1548961 1548962   3 unionDHS_ctrl
## 5 chr1 1666170 1666171  22 unionDHS_ctrl
## 6 chr1 2298957 2298958  43 unionDHS_ctrl

local control peak info + distance +status.

df.ctrl2 = data.frame(matrix(nrow = 0, ncol = 5))     
colnames(df.ctrl2) = c("V1","V2","v3", "dis", "status")
for (chip.peak in Sys.glob(file.path("./localctrl.200.distance.quantile*.bed"))) {
    print(chip.peak)
    quantile.name =strsplit((strsplit(strsplit(chip.peak, "/")[[1]][length(strsplit(chip.peak, "/")[[1]])], 'localctrl.200.distance.')[[1]][2]), ".bed")[[1]][1]
    print(quantile.name)
    all.distance = cbind(read.table(chip.peak,header=F, comment.char='')[,c(1:3, 11)], paste0("localctrl_200_", quantile.name)) 
    df.ctrl2 = rbind(df.ctrl2, all.distance)
}
## [1] "./localctrl.200.distance.quantile0.2.bed"
## [1] "quantile0.2"
## [1] "./localctrl.200.distance.quantile0.4.bed"
## [1] "quantile0.4"
## [1] "./localctrl.200.distance.quantile0.6.bed"
## [1] "quantile0.6"
## [1] "./localctrl.200.distance.quantile0.8.bed"
## [1] "quantile0.8"
## [1] "./localctrl.200.distance.quantile1.bed"
## [1] "quantile1"
str(df.ctrl2)
## 'data.frame':    38778 obs. of  5 variables:
##  $ V1                                     : chr  "chr1" "chr1" "chr1" "chr1" ...
##  $ V2                                     : int  925053 966853 999708 1000736 1002091 1020387 1069749 1079799 1163213 1163446 ...
##  $ V3                                     : int  925054 966854 999709 1000737 1002092 1020388 1069750 1079800 1163214 1163447 ...
##  $ V11                                    : int  59 60 4 76 19 54 166 34 153 26 ...
##  $ paste0("localctrl_200_", quantile.name): chr  "localctrl_200_quantile0.2" "localctrl_200_quantile0.2" "localctrl_200_quantile0.2" "localctrl_200_quantile0.2" ...
colnames(df.ctrl2) = c("V1","V2","V3", "dis", "status")
unique(df.ctrl2$status)
## [1] "localctrl_200_quantile0.2" "localctrl_200_quantile0.4"
## [3] "localctrl_200_quantile0.6" "localctrl_200_quantile0.8"
## [5] "localctrl_200_quantile1"
head(df.ctrl2)
##     V1      V2      V3 dis                    status
## 1 chr1  925053  925054  59 localctrl_200_quantile0.2
## 2 chr1  966853  966854  60 localctrl_200_quantile0.2
## 3 chr1  999708  999709   4 localctrl_200_quantile0.2
## 4 chr1 1000736 1000737  76 localctrl_200_quantile0.2
## 5 chr1 1002091 1002092  19 localctrl_200_quantile0.2
## 6 chr1 1020387 1020388  54 localctrl_200_quantile0.2

merge files.

df.all=rbind(df.chip, df.ctrl, df.ctrl2)
df.all$status = factor(df.all$status, levels = c("quantile0.2", "quantile0.4", "quantile0.6", "quantile0.8", "quantile1", "unionDHS_ctrl", "localctrl_200_quantile0.2", "localctrl_200_quantile0.4", "localctrl_200_quantile0.6", "localctrl_200_quantile0.8", "localctrl_200_quantile1" ))
head(df.all)
##     V1      V2      V3 dis      status
## 1 chr1  924853  924854   0 quantile0.2
## 2 chr1  966653  966654  65 quantile0.2
## 3 chr1  999508  999509  61 quantile0.2
## 4 chr1 1000536 1000537  91 quantile0.2
## 5 chr1 1001891 1001892  46 quantile0.2
## 6 chr1 1020187 1020188 143 quantile0.2
nrow(df.all)
## [1] 85356

5.2.2 empirically determine binding distance for each quantile

We can empirically determine the distance at which the CDFs’ difference between GATA3 peak set and ctrl peak set reach the plateaus (where CDF traces became parallel or converging).

unique(df.all$status)
##  [1] quantile0.2               quantile0.4              
##  [3] quantile0.6               quantile0.8              
##  [5] quantile1                 unionDHS_ctrl            
##  [7] localctrl_200_quantile0.2 localctrl_200_quantile0.4
##  [9] localctrl_200_quantile0.6 localctrl_200_quantile0.8
## [11] localctrl_200_quantile1  
## 11 Levels: quantile0.2 quantile0.4 quantile0.6 quantile0.8 ... localctrl_200_quantile1

Previously, comparing between subset of unionDHS regions and GATA3 peak in quantile1, we determined the converging/parallel distance cutoff at ~13bp.

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'unionDHS_ctrl'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile1'])
match.y = seq(0, 184, by=1) # creating indices
rep.y = seq(0, 184, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) # 13; 
## [1] 13
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) 
## [1] 13
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

Now that we have made local controls, we can determine the cutoff of GATA3 peak in each quantiles compared with its coresponding local ctrl:
quantile1 vs. localctrl_200_quantile1:

match = ecdf(abs(df.all$dis)[df.all$status == 'localctrl_200_quantile1'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile1'])
match.y = seq(0, 184, by=1) 
rep.y = seq(0, 184, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) 
## [1] 16
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) 
## [1] 16
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

quantile0.8 vs. localctrl_200_quantile0.8:

match = ecdf(abs(df.all$dis)[df.all$status == 'localctrl_200_quantile0.8'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.8'])
match.y = seq(0, 247, by=1) 
rep.y = seq(0, 247, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) 
## [1] 4
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

quantile0.6 vs. localctrl_200_quantile0.6:

match = ecdf(abs(df.all$dis)[df.all$status == 'localctrl_200_quantile0.6'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.6'])
match.y = seq(0, 260, by=1) 
rep.y = seq(0, 260, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) 
## [1] 2
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

quantile0.4 vs. localctrl_200_quantile0.4:

match = ecdf(abs(df.all$dis)[df.all$status == 'localctrl_200_quantile0.4'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.4'])
match.y = seq(0, 270, by=1) 
rep.y = seq(0, 270, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) 
## numeric(0)
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

quantile0.2 vs. localctrl_200_quantile0.2:

match = ecdf(abs(df.all$dis)[df.all$status == 'localctrl_200_quantile0.2'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.2'])
match.y = seq(0, 360, by=1) 
rep.y = seq(0, 360, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) 
## [1] 20
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

5.2.3 CDF

We are plotting the cumulative distribution function to evaluate whether a 3mer GAT are closer to the summit of GATA3 ChIP peaks (that might use 3mer GAT as a binding sequence) than they are to the summit of a random subset of union DHS regions (unionDHS_ctrl), as well as to the local control (summit shifted to downstream 200bp).

negative control comparison
First, we want to ask if the local control is comparable to the union DHS control?

library(lattice)
library(latticeExtra)

df.nc=rbind(df.ctrl, df.ctrl2)
df.nc$status = factor(df.nc$status, levels = c("unionDHS_ctrl", "localctrl_200_quantile0.2", "localctrl_200_quantile0.4", "localctrl_200_quantile0.6", "localctrl_200_quantile0.8", "localctrl_200_quantile1" ))

ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.nc,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("blue", colorRampPalette(c("grey85","grey30"))(5)),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,2.5),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("blue", colorRampPalette(c("grey85","grey30"))(5)), lwd=3), strip.background=list(col="grey85"))
         )

The blue trace is the subset of union DHS regions, it mostly overlap with the local control use peak in quantile1.

single plot for each quantile

quantile1 vs. localctrl_200_quantile1 vs. union DHS control:

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile1",], df.ctrl, df.ctrl2[df.ctrl2$status=="localctrl_200_quantile1",])
df1$status = factor(df1$status, levels = c("quantile1", "unionDHS_ctrl", "localctrl_200_quantile1"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("#ce228e", "lightblue", "grey30"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("#ce228e", "lightblue", "grey30"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 13, lty =2, col="lightblue")
             panel.abline(v= 16, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

It seems that unionDHS control is mostly overlap with local control within quantile1.
And GATA3 peak in quantile1 is clearly separated (left shift) from both control.

The light blue vertical line at 13 bp is what we empirically determined as the converging point compared between GATA3 peak in quantile 1 and unionDHS control; the grey vertical line at 16bp is what we empirically determined as the converging point compared between GATA3 peak in quantile 1 and local control.

quantile0.8 vs. localctrl_200_quantile0.8 vs. union DHS control:

library(lattice)
library(latticeExtra)

df0.8=rbind(df.chip[df.chip$status=="quantile0.8",], df.ctrl, df.ctrl2[df.ctrl2$status=="localctrl_200_quantile0.8",])
df0.8$status = factor(df0.8$status, levels = c("quantile0.8", "unionDHS_ctrl", "localctrl_200_quantile0.8"))


ecdfplot(~abs(dis), groups = status, data = df0.8,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("#ce228e", "lightblue", "grey30"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("#ce228e", "lightblue", "grey30"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 4, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

quantile0.6 vs. localctrl_200_quantile0.6 vs. union DHS control:

library(lattice)
library(latticeExtra)

df0.6=rbind(df.chip[df.chip$status=="quantile0.6",], df.ctrl, df.ctrl2[df.ctrl2$status=="localctrl_200_quantile0.6",])
df0.6$status = factor(df0.6$status, levels = c("quantile0.6", "unionDHS_ctrl", "localctrl_200_quantile0.6"))


ecdfplot(~abs(dis), groups = status, data = df0.6,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("#ce228e", "lightblue", "grey30"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("#ce228e", "lightblue", "grey30"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 2, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

quantile0.4 vs. localctrl_200_quantile0.4 vs. union DHS control:

library(lattice)
library(latticeExtra)

df0.4=rbind(df.chip[df.chip$status=="quantile0.4",], df.ctrl, df.ctrl2[df.ctrl2$status=="localctrl_200_quantile0.4",])
df0.4$status = factor(df0.4$status, levels = c("quantile0.4", "unionDHS_ctrl", "localctrl_200_quantile0.4"))


ecdfplot(~abs(dis), groups = status, data = df0.4,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("#ce228e", "lightblue", "grey30"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("#ce228e", "lightblue", "grey30"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 0, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

quantile0.2 vs. localctrl_200_quantile0.2 vs. union DHS control:

library(lattice)
library(latticeExtra)

df0.2=rbind(df.chip[df.chip$status=="quantile0.2",], df.ctrl, df.ctrl2[df.ctrl2$status=="localctrl_200_quantile0.2",])
df0.2$status = factor(df0.2$status, levels = c("quantile0.2", "unionDHS_ctrl", "localctrl_200_quantile0.2"))


ecdfplot(~abs(dis), groups = status, data = df0.2,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("#ce228e", "lightblue", "grey30"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("#ce228e", "lightblue", "grey30"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 20, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

GATA3 peaks in quantile 0.8 to 0.2 looks similar with the corresponding local control. It suggests that these peaks has no significant enrichment of 3mer GAT proximal to their summit compared to control.

all in one plot

library(lattice)
library(latticeExtra)

ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.all,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c(colorRampPalette(c("pink","red"))(4), "#ce228e", "blue", colorRampPalette(c("grey85","grey30"))(5)),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,2.5),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c(colorRampPalette(c("pink","red"))(4), "#ce228e", "blue", colorRampPalette(c("grey85","grey30"))(5)), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 1.2, lty =2) #log10(16)=1.2
             panel.ecdfplot(...)
         })

The above cumulative distribution function (PDF) plot is showing the distance between 3mer-GAT to the summit of GATA3 ChIP peak (red), or summit of the ctrl peak set (blue: random subset from the union DHS regions; grey: local control, downstream 200bp of the proginal peak summit).

The left-shift of the red curve suggests that the 3mer-GAT are closer to the GATA3 ChIP peak set in quantile1 than the ctrl peak set.

The vertical dashed line indicates at which distance (log10 of 16bp) the two CDF traces became parallel. This suggests that the 3mer-GAT often binds within 16 bp of the GATA3 peak summit.

ecdfplot(~abs(dis), groups = status, data = df.all,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c(colorRampPalette(c("pink","red"))(4), "#ce228e", "blue", colorRampPalette(c("grey85","grey30"))(5)),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c(colorRampPalette(c("pink","red"))(4), "#ce228e", "blue", colorRampPalette(c("grey85","grey30"))(5)), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 16, lty =2) 
             panel.ecdfplot(...)
         })

The difference between this CDF with the previous one is: The above one is plotting actual distance in bp, the previous one is plotting log10 of the distance. The two CDF plots are essentially communicating the same thing.

5.3 find closest 2nd GAT

In this section, we need to consider two things:
One is to define spacing/distance as the relative position of two zinc finger;
the other is to separate the cases with different strand orientation.

negative control1
In previous analysis, we used union DHS regions as the negative control for GATA3 peaks to compare against.

wc -l closest_1st_GAT_to_unionDHS_uniq_peak.sorted.bed
head -5 closest_1st_GAT_to_unionDHS_uniq_peak.sorted.bed
##     7800 closest_1st_GAT_to_unionDHS_uniq_peak.sorted.bed
## chr1 191000  191003  14  14  -   GAT
## chr1 1183537 1183540 14  14  -   GAT
## chr1 1207290 1207293 14  14  -   GAT
## chr1 1548964 1548967 36  36  +   GAT
## chr1 1666192 1666195 36  36  +   GAT

negative control2

We have also generated 5 local controls in intensity-ranked quantiles, each with the GATA3 peak summit shifted to downstream 200bp.

Notice that the first three column is the local control summit (200bp downstream of real peak summit), and column 4 to 10 is the 1st closest GAT coordinates found by closestbed, the last column is the distance.

We can sort and select only the 3mer coordinates info to new file.

for i in localctrl.200.distance.quantile*.bed
do
  nm=$(echo $i | awk -F "localctrl.200.distance." '{print $2}' | awk -F".bed" '{print $1}')
  echo $nm
  awk '{print $4, $5, $6, $7, $8, $9, $10}' $i | awk '{$1=$1}1' OFS="\t" | sort | uniq > closest_1st_GAT_to_localctrl_200_uniq_${nm}.sorted.bed
  wc -l closest_1st_GAT_to_localctrl_200_uniq_${nm}.sorted.bed
  head -5 closest_1st_GAT_to_localctrl_200_uniq_${nm}.sorted.bed
done
## quantile0.2
##     7569 closest_1st_GAT_to_localctrl_200_uniq_quantile0.2.sorted.bed
## chr1 1000812 1000815 36  36  +   GAT
## chr1 1002070 1002073 14  14  -   GAT
## chr1 100351698   100351701   14  14  -   GAT
## chr1 10035708    10035711    36  36  +   GAT
## chr1 101026001   101026004   36  36  +   GAT
## quantile0.4
##     7791 closest_1st_GAT_to_localctrl_200_uniq_quantile0.4.sorted.bed
## chr1 100249990   100249993   36  36  +   GAT
## chr1 101004440   101004443   14  14  -   GAT
## chr1 1013519 1013522 14  14  -   GAT
## chr1 1013786 1013789 36  36  +   GAT
## chr1 10210411    10210414    36  36  +   GAT
## quantile0.6
##     7784 closest_1st_GAT_to_localctrl_200_uniq_quantile0.6.sorted.bed
## chr1 100038358   100038361   14  14  -   GAT
## chr1 10033297    10033300    14  14  -   GAT
## chr1 101371565   101371568   14  14  -   GAT
## chr1 102772572   102772575   36  36  +   GAT
## chr1 10599439    10599442    14  14  -   GAT
## quantile0.8
##     7781 closest_1st_GAT_to_localctrl_200_uniq_quantile0.8.sorted.bed
## chr1 10514921    10514924    36  36  +   GAT
## chr1 1073954 1073957 36  36  +   GAT
## chr1 107683428   107683431   14  14  -   GAT
## chr1 107688373   107688376   14  14  -   GAT
## chr1 107688777   107688780   14  14  -   GAT
## quantile1
##     7792 closest_1st_GAT_to_localctrl_200_uniq_quantile1.sorted.bed
## chr1 100486336   100486339   14  14  -   GAT
## chr1 10083454    10083457    36  36  +   GAT
## chr1 101798210   101798213   36  36  +   GAT
## chr1 102941895   102941898   14  14  -   GAT
## chr1 10511265    10511268    14  14  -   GAT

GATA3 peaks
In previous analysis we have called the closest GAT to GATA3 peaks (without MEME/STREME motifs) and saved them in 5 quantile files ranked by their intensity.

for i in closest_1st_GAT_to_GATA_uniq_peak_quantile*.bed
do
  wc -l $i
  head -5 $i
done
##     7563 closest_1st_GAT_to_GATA_uniq_peak_quantile0.2.bed
## chr1 1000627 1000630 36  36  +   GAT
## chr1 1001937 1001940 36  36  +   GAT
## chr1 100351435   100351438   36  36  +   GAT
## chr1 10035675    10035678    36  36  +   GAT
## chr1 101025693   101025696   14  14  -   GAT
##     7781 closest_1st_GAT_to_GATA_uniq_peak_quantile0.4.bed
## chr1 100249748   100249751   36  36  +   GAT
## chr1 101004290   101004293   14  14  -   GAT
## chr1 1013247 1013250 36  36  +   GAT
## chr1 1013584 1013587 36  36  +   GAT
## chr1 10210228    10210231    36  36  +   GAT
##     7785 closest_1st_GAT_to_GATA_uniq_peak_quantile0.6.bed
## chr1 100038217   100038220   14  14  -   GAT
## chr1 10033076    10033079    14  14  -   GAT
## chr1 101371391   101371394   36  36  +   GAT
## chr1 102772386   102772389   36  36  +   GAT
## chr1 10599290    10599293    36  36  +   GAT
##     7781 closest_1st_GAT_to_GATA_uniq_peak_quantile0.8.bed
## chr1 10514718    10514721    36  36  +   GAT
## chr1 1073808 1073811 36  36  +   GAT
## chr1 107683212   107683215   36  36  +   GAT
## chr1 107688147   107688150   36  36  +   GAT
## chr1 107688516   107688519   36  36  +   GAT
##     7791 closest_1st_GAT_to_GATA_uniq_peak_quantile1.bed
## chr1 100486183   100486186   14  14  -   GAT
## chr1 10083257    10083260    36  36  +   GAT
## chr1 101797999   101798002   14  14  -   GAT
## chr1 102941741   102941744   14  14  -   GAT
## chr1 10511140    10511143    36  36  +   GAT

5.3.1 PDF

5.4 ENCODE DHS in MCF7 cells

Stam_MCF-7_1: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM736581
Stam_MCF-7_2: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM736588

GSE29692: “Cells were grown according to the approved ENCODE cell culture protocols. Digital DNaseI was performed by DNaseI digestion of intact nuclei, isolating DNaseI ‘double-hit’ fragments as described in Sabo et al. (2006), and direct sequencing of fragment ends (which correspond to in vivo DNaseI cleavage sites) using the Solexa platform (36 bp reads). Uniquely mapping high-quality reads were mapped to the genome. DNaseI sensitivity is directly reflected in raw tag density (Raw Signal), which is shown in the track as density of tags mapping within a 150 bp sliding window (at a 20 bp step across the genome). DNaseI sensitive zones (HotSpots) were identified using the HotSpot algorithm described in Sabo et al. (2004). 1.0% false discovery rate thresholds (FDR 0.01) were computed for each cell type by applying the HotSpot algorithm to an equivalent number of random uniquely mapping 36mers. DNaseI hypersensitive sites (DHSs or Peaks) were identified as signal peaks within FDR 1.0% hypersensitive zones using a peak-finding algorithm.”

Notice that these data are using hg19 genome, we need to use UCSC liftover to convert data to hg38.

5.5 positive control

There are 9 GATA3 peak sets each with 1 known enriched GATA3-like motif found by MEME/STREME.

5.6 loop thourgh all bigWigs

sys.glob()